2012 年在 Google 時強制使用統一標準的 BUILD 檔案

Hacker News 上看到「Reformatting 100k Files at Google in 2011 (le-brun.eu)」這篇,原文在「The Story of Reformatting 100k Files at Google in 2012」這邊。

這篇除了作者寫的東西以外,Russ Cox 也在 Hacker News 上面分享了一些當時的背景。

故事是 2012 年的時候,作者 Laurent Le Brun (LinkedIn) 在德國,剛加入 Google 沒多久,負責處理 Google 內部的編譯工具 Blaze (外部稱 Bazel):

Back in September 2012, I was a junior engineer at Google, working on Bazel (Google’s build tool, also known internally as Blaze).

然後他與他的 team lead 收到美國團隊兩位工程師的會議邀請:

One day, a mysterious calendar invite landed in my inbox. It was sent by two engineers in the US, and I was invited along with my team lead.

然後這兩位是 Rob Pike 以及 Russ Cox:

I quickly recognized the names: Rob Pike and Russ Cox. Though I hadn't worked with them, I knew them by reputation: Russ Cox because I enjoyed reading his blog posts, and Rob Pike because, well… he’s famous.

作者這邊也懶得介紹 Rob Pike 了,畢竟是當年 Bell LabsPlan 9 的頭頭,然後又是 UTF-8 的發明人,後來在 Google 裡面也是 Go (程式語言) 的頭頭... 順便一提 Go 的吉祥物是他老婆畫的。

他們兩個人希望把全公司所有的 BUILD 檔案格式統一:

During the meeting, Rob and Russ shared their ambitious plan: to reformat every single Bazel BUILD file in Google’s codebase and enforce this formatting with a presubmit script.

中間有提到一些工程上的進行方式,但對我來說比較重要的是這幾個,首先是 Russ Cox 提到他在 Go 的團隊裡面學到的,(在大公司裡面) 不要讓工程師浪費時間在 formatting 之類的問題:

The formatting issue is a problem that should not exist and that we care enough about to spend our own engineering time eliminating it. I understand that it does not seem a priori that automated formatting would make a significant difference. I didn't truly understand it myself until we eliminated it from the Go code review and code editing processes. Hopefully in six months or a year, when everything is converted and we look back at this, the benefits will be clearer in retrospect.

另外在 update 的地方作者也提到了當時決定一次到位,把所有的 BUILD 檔案都修正完,而不是讓後續的人在更新時才強制要求,這樣可以避免後面的人接手的時候產生大量的 diff 而導致 review 困難:

Why not enable the new formatter without updating all the files? This would lead to long diffs whenever someone modifies a file, making the change hard to review. The cost would be spread over all the engineers in the company.

Google Public DNS 接受法國法院的阻擋要求

看到「Google, Cloudflare & Cisco Will Poison DNS to Stop Piracy Block Circumvention」這篇,法國在 2022 年通過的體育法律反過來干涉 ISP 或是服務提供商需要配合阻擋:

Tampering with public DNS is a step too far for many internet advocates but for major rightsholders, if the law can be shaped to allow it, that’s what will happen. In this case, Article L333-10 of the French Sports Code (active Jan 2022) seems capable of accommodating almost anything.

拿文章裡面提到的 footybite.cc 測試,實際在法國開一台 Vultr 的 VPS 測試各家 Public DNS 服務,看起來目前 Google Public DNS 已經實作了,而且傳回了 RFC 8914: Extended DNS Errors 內的 EDE 16:

$ dig footybite.cc @8.8.8.8

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 512
; EDE: 16 (Censored): (The requested domain is on a court ordered copyright piracy blocklist for FR (ISO country code). To learn more about this specific removal, please visit https://lumendatabase.org/notices/41606068.)
;; QUESTION SECTION:
;footybite.cc.                  IN      A

目前拿 1.1.1.1 (Cloudflare)、9.9.9.9 (Quad9) 以及 208.67.222.222 (OpenDNS) 都還沒有看到被擋。

另外實際測試,自己架設 Unbound 看起來就可以繞過去了,不知道後續會不會要求更多,像是直接要求在 internet backbone 上面過濾 DNS?(當年推 DNS over TLSDNS over HTTPS 總算要派上用場了?)

另外就是看 Cloudflare 以及其他 Public DNS 服務有沒有反對的動作...

Google 停用了大量與中國與俄羅斯相關的帳號

在「Google Takes Down Influence Campaigns Tied to China, Indonesia, and Russia」這邊看到的,Google 的說明則是在「TAG Bulletin: Q2 2024」這邊,看起來像是例行性的更新?

與台灣有關的當然就是跟中國相關的影響,也是被停最多帳號的,在報告的最後提到 YouTubeBlogger 上面有掃到上千個與中國政府相關的宣傳帳號:

We terminated 1,320 YouTube channels and 1,177 Blogger blogs as part of our ongoing investigation into coordinated influence operations linked to the People’s Republic of China (PRC). The coordinated inauthentic network uploaded content in Chinese and English about China and U.S. foreign affairs. These findings are consistent with our previous reports.

第二多的則是俄羅斯:

We terminated 378 YouTube channels as part of our investigation into coordinated influence operations linked to Russia. The campaign was linked to a Russian consulting firm and was sharing content in Russian that was supportive of Russia and critical of Ukraine and the West.

其他的就比較零頭了...

Google 承認搜尋引擎內部 API 文件洩漏了?

前幾天很熱門的「An Anonymous Source Shared Thousands of Leaked Google Search API Documents with Me; Everyone in SEO Should See Them」消息,裡面提到有拿到一份疑似 Google 搜尋引擎的內部 API 文件, 可以證實有很多 Google 搜尋引擎的運作與 Google 對外宣稱的不符。

作者找了業內人士幫忙分析 (他自己說他先前也在碰 SEO 這塊,但是已經離開這個行業六年了):

Next, I needed help analyzing and deciphering the naming conventions and more technical aspects of the documentation. I’ve worked with APIs a bit, but it’s been 20 years since I wrote code and 6 years since I practiced SEO professionally.

包括作者的一些 Google 朋友,或是 ex-Googler 都確認這份文件符合 Google 內部的文件規範要求,另外裡面的元素編排也都很像是 Google 的文件。

本來以為事情大概就這樣,後續應該就是會有很多人從這份文件分析 Google 有哪些 SEO 的偏好,找出哪些東西與 Google 宣稱的不符。

不過事情突然有個意外的轉折,Google 本來一直是「拒絕評論」的態度,但突然承認這份文件的確是他們內部文件:「Google confirms the leaked Search documents are real」。

"We would caution against making inaccurate assumptions about Search based on out-of-context, outdated, or incomplete information," Google spokesperson Davis Thompson told The Verge in an email. "We’ve shared extensive information about how Search works and the types of factors that our systems weigh, while also working to protect the integrity of our results from manipulation."

反正也沒有人會相信 outdated 了,但可以預想的是 Google 的搜尋結果應該又會變差,因為會有更多 SEO 垃圾開始想辦法衝排名上去...

用 udm=14 拿掉 Google Search 的一堆附加功能

在「&udm=14 | the disenshittification Konami code」這邊看到的,裡面有提到「How I Made Google’s “Web” View My Default Search」這篇說明。

我測了一下「how many people in the earth」這個搜尋條件,結果是這樣 (w/ uBlock Origin):

加上 &udm=14 後會拿掉很多「輔助」的區塊:

有些人可能偏好前者,但有些人偏好後者,可以自己選擇... (不過我應該會繼續用 Kagi)

Porkbun 的網域註冊服務

在「Tell HN: I think there are major issues with Google –> Squarespace domains」這篇裡面提到 Google 把網域註冊服務賣給了 Squarespace 後發生的問題,結果下面有很多人都提到 Porkbun 這家註冊商。

看了一下留言,主要的缺點是沒有支援所有的 tld,所以有些 domain 可能沒辦法轉移到 Porkbun,不過大多數常見的都有支援。

在台灣的話... 因為 Gandi 在台灣有公司可以直接開發票報帳,對很多台灣公司來說方便不少。

先記錄起來看看...

Chrome 124 啟用了 X25519Kyber768

在「Google Chrome's new post-quantum cryptography may break TLS connections」這邊看到的,出自「Q1 2024 Summary from Chrome Security」這邊的公告。

Chrome 124 預設啟用 X25519Kyber768 了:

After several months of experimentation for compatibility and performance impacts, we’re launching a hybrid postquantum TLS key exchange to desktop platforms in Chrome 124. This protects users’ traffic from so-called “store now decrypt later” attacks, in which a future quantum computer could decrypt encrypted traffic recorded today.

如果遇到問題的話可以在 chrome://flags 裡面的 #enable-tls13-kyber 關閉後重開瀏覽器... 不過應該是還好,大多數的 HTTPS server 應該都沒有實作 X25519Kyber768,就算是 Cloudflare 應該預設也是關的。

Reddit 當初對 Google 搜尋引擎的客製化設計

Hacker News 上看到的討論,裡面剛好有 Reddit 第一個雇用的工程師,Jeremy Edberg 的留言:「Reddit is taking over Google (businessinsider.com)」。

id=40068381 這邊提到了不少東西,首先是把 title 放進 url 裡面的作法:

To this day, my most public contribution to reddit is that I wrote the code to put the title of the post in the URL. That was done specifically for SEO purposes.

這個在 Google Webmasters (現在叫做 Google Search Console) 也針對 Reddit 處理,將速率強制設為 Custom,不讓 Reddit 的人改:

It was pretty much the only SEO optimization we ever did (along with a few DOM changes), because shortly after that, Google basically dedicated engineering effort specifically to crawling reddit. So much so that we lost the "crawl rate" button in our SEO admin page on Google, it was just set to "Custom".

後續還要針對 Google 的抓取在 load balancer 上把流量拆開處理,不然 crawling pattern 與一般使用情境很不一樣,會造成 cache 的效率極度低落:

I had to stand up a fleet of app servers and caches and databases, and change the load balancers so that Google basically had their own infrastructure (although we would shunt all crawlers there). Crawler traffic was very different than regular traffic -- it looked at pages more than two days old, something humans rarely did at the time. It would blow out every cache (memory, database, disk, etc.). So we put them on their own infra to keep them from killing the rest of the site.

這些算是頗有趣的經驗?

Google 推出 Jpegli:JPEG 的 encoder 以及 decoder

Hacker News 上看到「Jpegli: A new JPEG coding library (googleblog.com)」,原文是「Introducing Jpegli: A New JPEG Coding Library」。

裡面是有提到檔案大小可以更小:

Our results show that Jpegli can compress high quality images 35% more than traditional JPEG codecs.

但這邊沒有講壓縮率的部分是哪個「traditional JPEG codec」比較,大概就是「WebP 的檔案大小未必比 JPEG 小...」這邊的老招了,應該是跟 cjpeg 比,如果跟 MozJPEG 比的話就未必有那麼高了。

想要寫起來的是 Hacker News 留言有提到命名的邏輯,這個 -li 結尾的用法可以看出來是 Google 蘇黎世團隊的產品:

The suffix -li is used in Swiss German dialects. It forms a diminutive of the root word, by adding -li to the end of the root word to convey the smallness of the object and to convey a sense of intimacy or endearment.

This obviously comes out of Google Zürich.

這點還可以從其他的產品看到類似的命名,比較熟的是 Zopfli 以及 Brotli

Google 的 HyperLogLog++

算是接續昨天寫的「Redis 對 HyperLogLog 省空間的實作」,在 RedisHyperLogLog 實作有提到 Google 在 2013 年的論文「HyperLogLog in Practice: Algorithmic Engineering of a State of The Art Cardinality Estimation Algorithm」,裡面提出了 HyperLogLog++ (HLL++)。

論文中 Google 提出來的改進主要有三個,第一個是用了 64-bit hash function:

5.1 Using a 64 Bit Hash Function

原因也有提到,當需要處理超過十億筆資料時,32-bit hash 的 4B 上限就有點不太夠用:

To fulfill the requirement of being able to estimate multisets of cardinalities beyond 1 billion, we use a 64 bit hash function.

這點也簡化了原始論文中當 cardinality 接近 232 時的 count 估算修正,因為實務上不太可能遇到 264 上限?

It would be possible to introduce a similar correction if the cardinality
approaches 264, but it seems unlikely that such cardinalities are encountered in practice.

第二個是改善 cardinality 比較低時的 count 估算公式:

5.2 Estimating Small Cardinalities

第三個則是提出了 sparse representation 的概念,在 cardinality 比較低的時候不需要直接用完整的 HLL 資料結構儲存:

5.3 Sparse Representation

另外也是一路往回讀的關係,查資料才了解 (ε, δ)-model 的用法,在「A Quick Introduction to Data Stream Algorithmics」這邊投影片的 Page 17 可以看到指的是,有多少正確性 (randomized) 是給出 ε 內的誤差 (approximation),像是 Page 18 的範例:

Example: Actual answer is within 5 ± 1 with prob ≥ 0.9