Reddit 當初對 Google 搜尋引擎的客製化設計

Hacker News 上看到的討論,裡面剛好有 Reddit 第一個雇用的工程師,Jeremy Edberg 的留言:「Reddit is taking over Google (businessinsider.com)」。

id=40068381 這邊提到了不少東西,首先是把 title 放進 url 裡面的作法:

To this day, my most public contribution to reddit is that I wrote the code to put the title of the post in the URL. That was done specifically for SEO purposes.

這個在 Google Webmasters (現在叫做 Google Search Console) 也針對 Reddit 處理,將速率強制設為 Custom,不讓 Reddit 的人改:

It was pretty much the only SEO optimization we ever did (along with a few DOM changes), because shortly after that, Google basically dedicated engineering effort specifically to crawling reddit. So much so that we lost the "crawl rate" button in our SEO admin page on Google, it was just set to "Custom".

後續還要針對 Google 的抓取在 load balancer 上把流量拆開處理,不然 crawling pattern 與一般使用情境很不一樣,會造成 cache 的效率極度低落:

I had to stand up a fleet of app servers and caches and databases, and change the load balancers so that Google basically had their own infrastructure (although we would shunt all crawlers there). Crawler traffic was very different than regular traffic -- it looked at pages more than two days old, something humans rarely did at the time. It would blow out every cache (memory, database, disk, etc.). So we put them on their own infra to keep them from killing the rest of the site.

這些算是頗有趣的經驗?

備份 Xuite Blog 的公開文章

中華的 Xuite 前陣子宣佈了服務中止的公告:「Xuite隨意窩平台服務終止公告」(這邊就先拉 Internet Archive 的連結了,看起來之後會消失...)。

Blog 的部份,除了作者本身可以拉資料下來放到其他平台以外,外人也可以把這些歷史遺跡保留下來,像是丟到 Internet Archive 的 Wayback Machine 上面。

所以用 Perl 寫了一隻 script,把 url 掃出來後,後續就可以用其他工具 submit 到 Wayback Machine 上面:「xuite-urldump」。

當年有不少 ACG 相關的 blog 在上面,先來備份起來...

Kagi 開發的 Universal Summarizer

在「Universal Summarizer (kagi.com)」這邊看到的新服務,可以給出某個網址的 summary。服務的本體則是在「Kagi - Universal Summarizer」這邊,從網址可以猜測是 Kagi 的實驗項目。

像是 CNN 的「Biden’s dramatic warning to China」這篇,他抓出對應的 summary 看起來沒什麼問題:

President Joe Biden delivered a dramatic warning to China in his State of the Union address, vowing to protect America's sovereignty if China threatens it. He specifically named President Xi Jinping, saying "Name me a world leader who'd change places with Xi Jinping. Name me one!" This marked a stark escalation in the US-Chinese relationship, which has been strained by a balloon surveillance program and other issues. Biden also addressed Russia, calling their invasion of Ukraine a test for America and the world. His speech highlighted the unified opposition to China in US politics, with House Speaker Kevin McCarthy having convened a bipartisan House committee to examine the perceived threat from the Chinese Communist Party. Biden's comments also served as an important milestone in the increasingly tumultuous competition between the US and China, as the US shifts to talking about establishing guardrails for the relationship and protecting the Western-led rules-based international system.

這個功能可以用 GPT-3.5 或是 ChatGPT 串,但不確定 Kagi 是串上去還自己搞?我猜有蠻大的機會是串的...

然後他的網址設計因為是 url 傳遞的方式,可以包裝成 bookmarklet 形式放在快速列上面用:「Kagi Universal Summarizer」。

另外在 HN 的討論裡面看到很厲害的用法,把 https://bellard.org/quickjs/pi_bigdecimal.js 這段 javascript code 的 url 丟進去,然後居然出現了程式碼的說明,居然還正確判斷出是 Chudnovsky algorithm

This code uses the QuickJS bigdecimal type to calculate the value of pi to a given precision. It does this by using the Chudnovsky algorithm, which is a series of calculations that can be used to approximate pi. The code is written in Javascript and uses BigInt and BigDecimal to perform the calculations. It is interesting to note that the code also takes into account the possibility of bad rounding for the last digits, and adds extra digits to reduce the probability of this happening.

既然包了一個 bookmarklet,最近應該會常常拿出來用...

微軟的 Outlook 系統會自動點擊信件內的連結

前幾天在 Hacker News Daily 上翻到的,微軟的 Outlook 系統 (雲端上的系統) 會自動點擊信件內的連結,導致一堆問題:「“Magic links” can end up in Bing search results — rendering them useless.」,在 Hacker News 上的討論也有很多受害者出來抱怨:「“Magic links” can end up in Bing search results, rendering them useless (medium.com/ryanbadger)」。

原文的標題寫的更批評,指控 Outlook 會把這些 link 丟到 Bing 裡面 index,這點還沒有看到確切的證據。

先回到連結被點擊的問題,照文章內引用的資料來看,看起來是 2017 年開始就有的情況:「Do any common email clients pre-fetch links rather than images?」。

As of Feb 2017 Outlook (https://outlook.live.com/) scans emails arriving in your inbox and it sends all found URLs to Bing, to be indexed by Bing crawler.

在 Hacker News 上的討論也提到了像是 one-time login email 的機制也會因此受到影響,被迫要用比較費工夫的方法讓使用者登入 (像是給使用者 one-time code 輸入,而不是點 link 就可以登入)。

先記起來,以後在設計時應該會遇到,要重新思考 threat model...

用 uBlock Origin 過濾 URL 裡面的 tracking parameter

在「ClearURLs – automatically remove tracking elements from URLs (github.com/clearurls)」這邊的討論裡面看到 gorhill (uBlock Origin 的作者) 的回文,裡面提到了 uBlock Origin 目前也有支援 removeparam 了,而且有對應的 filter list 在維護這個表格:

不過他也有提到 CleanURLs 可以清更多東西:

Addendum: to be clear, this is not a replacement for ClearURLs. ClearURLs has more capabilities then just removing query parameters from the URLs of outgoing network requests.

但這樣起來也不錯了 (尤其是對於只裝 uBlock Origin 的情況下),可以訂起來...

Google Chrome 要推動預設使用 HTTPS 連線

Google Chrome 從 90 版要把 https:// 變成網址輸入時的預設值:「A safer default for navigation: HTTPS」。Hacker News 上的「Chrome’s address bar will use https:// by default (chromium.org)」也可以看一下。

也就是說,沒有輸入 schema 的網址,預設會用 https:// 方式連線,以往這點需要透過 HTTPS Everywhere 這種套件,然後開啟「Encrypt All Sites Eligible」這樣的參數,像是這樣:

這樣會再推一把...

Google Chrome 的顯示完整 URL 的方式又改了...

剛剛更新 Google Chrome 後發現 address bar 又被動手腳了 (我是 beta channel 的版本),在網址列上的 Protocol (Scheme) 與 www subdomain 不見了,而本來裝 Suspicious Site Reporter 可以讓 URL bar 恢復的方式也失效了。

翻了一下老問題「Chrome address bar no longer shows protocol or www subdomain」,裡面有提到新的解法 (他寫的當下) 是「The current solution in Chrome 83+」這個,直接在網址列上選右鍵,可以看到 Always show full URLs 這個選項,這是整個 Google Chrome 的選項,而非單一站台選項,選下去後就生效了:

而本來的 Suspicious Site Reporter 也可以拔掉了 (沒有用處了)。

Google Chrome 又要對 URL 搞事了...

上個禮拜鬧的頗兇的話題,Google Chrome 又要對 URL 搞事了:「Google resumes its senseless attack on the URL bar, hides full addresses on Chrome 85」。

這次是打算把整個 path 藏起來:

A few new feature flags have appeared in Chrome's Dev and Canary channels (V85), which modify the appearance and behavior of web addresses in the address bar. The main flag is called "Omnibox UI Hide Steady-State URL Path, Query, and Ref" which hides everything in the current web address except the domain name. For example, "https://www.androidpolice.com/2020/06/07/lenovo-ideapad-flex-5-chromebook-review/" is simply displayed as "androidpolice.com."

來繼續等 3rd-party browser 成熟...

Chrome 打算要終止支援 FTP 協定

從「Google plans to deprecate FTP URL support in Chrome」這邊看到的,狀態資訊可以在「Deprecate FTP support」這邊看到。

以目前的 timeline 資訊,看起來是 M82 版本會完全拔掉:

M78 (2019Q4)
Finch controlled flag and enterprise policy for controlling overall FTP support.

Support disabled on pre-release channels.

M80 (2020Q1)
Gradual turndown of FTP support on stable.

M82 (2020Q2)
Removal of FTP related code and resources.

不過這樣就沒有方便的 FTP downloader 了 (雖然不常見),得另外再找軟體下載...

把 HTML 資料都塞到 URL 裡的服務

如標題說的,專案叫做「URL Pages」,是一個把 HTML 的原始碼都放到 url string 裡的服務。服務本身可以在 http://jstrieb.github.io/urlpages 這邊使用。

現在網路上有很多地方可以塞 demo page (像是 GitHub 自己提供的 GitHub Pages),這個方式需要依賴 JavaScript 提供各種轉換機制,應該會有不少 side-effect bug,看起來會有不少技術瓶頸,就當作是個實驗性質的專案來看,真的需要放 demo page 的還是可以隨便找空間放比較簡單,像是前面提到的 GitHub Pages 其實可以直接在 GitHub 的網頁上操作...