2012 年在 Google 時強制使用統一標準的 BUILD 檔案

Hacker News 上看到「Reformatting 100k Files at Google in 2011 (le-brun.eu)」這篇,原文在「The Story of Reformatting 100k Files at Google in 2012」這邊。

這篇除了作者寫的東西以外,Russ Cox 也在 Hacker News 上面分享了一些當時的背景。

故事是 2012 年的時候,作者 Laurent Le Brun (LinkedIn) 在德國,剛加入 Google 沒多久,負責處理 Google 內部的編譯工具 Blaze (外部稱 Bazel):

Back in September 2012, I was a junior engineer at Google, working on Bazel (Google’s build tool, also known internally as Blaze).

然後他與他的 team lead 收到美國團隊兩位工程師的會議邀請:

One day, a mysterious calendar invite landed in my inbox. It was sent by two engineers in the US, and I was invited along with my team lead.

然後這兩位是 Rob Pike 以及 Russ Cox:

I quickly recognized the names: Rob Pike and Russ Cox. Though I hadn't worked with them, I knew them by reputation: Russ Cox because I enjoyed reading his blog posts, and Rob Pike because, well… he’s famous.

作者這邊也懶得介紹 Rob Pike 了,畢竟是當年 Bell LabsPlan 9 的頭頭,然後又是 UTF-8 的發明人,後來在 Google 裡面也是 Go (程式語言) 的頭頭... 順便一提 Go 的吉祥物是他老婆畫的。

他們兩個人希望把全公司所有的 BUILD 檔案格式統一:

During the meeting, Rob and Russ shared their ambitious plan: to reformat every single Bazel BUILD file in Google’s codebase and enforce this formatting with a presubmit script.

中間有提到一些工程上的進行方式,但對我來說比較重要的是這幾個,首先是 Russ Cox 提到他在 Go 的團隊裡面學到的,(在大公司裡面) 不要讓工程師浪費時間在 formatting 之類的問題:

The formatting issue is a problem that should not exist and that we care enough about to spend our own engineering time eliminating it. I understand that it does not seem a priori that automated formatting would make a significant difference. I didn't truly understand it myself until we eliminated it from the Go code review and code editing processes. Hopefully in six months or a year, when everything is converted and we look back at this, the benefits will be clearer in retrospect.

另外在 update 的地方作者也提到了當時決定一次到位,把所有的 BUILD 檔案都修正完,而不是讓後續的人在更新時才強制要求,這樣可以避免後面的人接手的時候產生大量的 diff 而導致 review 困難:

Why not enable the new formatter without updating all the files? This would lead to long diffs whenever someone modifies a file, making the change hard to review. The cost would be spread over all the engineers in the company.

Ken Thompson 的密碼

剛剛看到這串還蠻歡樂的...

起因於 BSD 3 的程式碼裡面有個 /etc/passwd,而且是帶有 crypt 的版本:「unix-history-repo/etc/passwd」。

裡面有蠻多密碼都已經被解出來了,但還是有些還沒解出來... 而最近的消息是 ken (Ken Thompson) 的密碼被解了出來:「Ken Thompson's Unix password」。

From: Nigel Williams <nw@retrocomputingtasmania.com>
Cc: TUHS main list <tuhs@minnie.tuhs.org>
Subject: Re: [TUHS] Recovered /etc/passwd files
Date: Wed, 9 Oct 2019 16:49:48 +1100

ken is done:

ZghOT0eRm4U9s:p/q2-q4!

took 4+ days on an AMD Radeon Vega64 running hashcat at about 930MH/s
during that time (those familiar know the hash-rate fluctuates and
slows down towards the end).

另外解出來的人也發現了這組密碼是一組西洋棋的 Descriptive notation,跟 Ken Thompson 的背景也相符:

From: Nigel Williams <nw@retrocomputingtasmania.com>
Cc: TUHS main list <tuhs@minnie.tuhs.org>
Subject: Re: [TUHS] Recovered /etc/passwd files
Date: Wed, 9 Oct 2019 16:52:00 +1100

On Wed, Oct 9, 2019 at 4:49 PM Nigel Williams
<nw@retrocomputingtasmania.com> wrote:
> ZghOT0eRm4U9s:p/q2-q4!

BTW, is that a chess move?

不過我覺得最好玩的是這個,不確定是不是本尊就是了:

From: Ken Thompson via TUHS <tuhs@minnie.tuhs.org>
To: Andy Kosela <akosela@andykosela.com>
Cc: TUHS main list <tuhs@minnie.tuhs.org>
Subject: Re: [TUHS] Recovered /etc/passwd files
Date: Wed, 9 Oct 2019 01:53:25 -0700

congrats.

On Wed, Oct 9, 2019 at 1:16 AM Andy Kosela <akosela@andykosela.com> wrote:
>
> On 10/9/19, Warner Losh <imp@bsdimp.com> wrote:
> > On Tue, Oct 8, 2019, 11:52 PM Nigel Williams
> > <nw@retrocomputingtasmania.com>
> > wrote:
> >
> >> On Wed, Oct 9, 2019 at 4:49 PM Nigel Williams
> >> <nw@retrocomputingtasmania.com> wrote:
> >> > ZghOT0eRm4U9s:p/q2-q4!
> >>
> >> BTW, is that a chess move?
> >>
> >
> > Most common opening.
> >
>
> Descriptive chess notation is not as popular today as it was back in
> the 70s, but it actually makes perfect sense as Ken is a long time
> chess enthusiast.
>
> --Andy

還有 Rob Pike 對這件事情不怎麼贊同的看法:

From: Rob Pike <robpike@gmail.com>
To: Nigel Williams <nw@retrocomputingtasmania.com>
Cc: TUHS main list <tuhs@minnie.tuhs.org>
Subject: Re: [TUHS] Recovered /etc/passwd files
Date: Wed, 9 Oct 2019 09:59:43 -1000

I coulda told you that. One tends to learn passwords (inadvertently) when
they're short and typed nearby often enough. (Sorry, ken.)

If I remember right, the first half of this password was on a t-shirt
commemorating Belle's first half-move, although its notation may have been
different.

Interesting though it is, though, I find this hacking distasteful. It was
distasteful back when, and it still is. The attitudes around hackery have
changed; the position nowadays seems to be that the bad guys are doing it
so the good guys should be rewarded for doing it first. That's disingenuous
at best, and dangerous at worst.

-rob


On Tue, Oct 8, 2019 at 7:50 PM Nigel Williams <nw@retrocomputingtasmania.com>
wrote:

> ken is done:
>
> ZghOT0eRm4U9s:p/q2-q4!
>
> took 4+ days on an AMD Radeon Vega64 running hashcat at about 930MH/s
> during that time (those familiar know the hash-rate fluctuates and
> slows down towards the end).
>

意外的引誘到一群人跑出來...

非常經典的 UTF-8...

Hacker News 文摘上看到「UTF-8 – “The most elegant hack”」這篇。除了維基百科上的資料以外,Rob Pike 與其他人在 2003 年寫的 mail 也是相當重要的資料。

Ken Thompson 與 Rob Pike 兩位發展出來的 UTF-8 被譽為最優雅的 hack 真的一點都不為過。Unicode 1.0 在 1991 年 10 月公佈。之後就陸陸續續有表示的格式出來...

相容於 ASCII 0-127 的 UTF-1 在 1992 年被提出來,但 parsing performance 並不好。

1992 年 7 月,Dave Prosser 提出 FSS-UTF,很類似後來的 UTF-8 但缺乏 self-synchronizing 特性 (這個特性指的是,從字串中間可以很容易找到切割點)。

1992 年 8 月,Ken Thompson 改善了 FSS-UTF,讓 bit 使用效率低一點,但因此擁有 self-synchronizing 特性。之後在 1992 年 9 月,Rob Pike 與 Ken Thompson 將 UTF-8 實做到 Plan 9 上。而 UTF-8 正式公開發表則是在 1993 年 1 月的 USENIX 上。

UTF-8 的設計看起來很 hack,但卻有這些優美的特性:

與既有系統的相容性

只包含 ASCII 0-127 的字串是合法的 UTF-8 字串。

重點是 0 被保留下來,也就是本來的 NULL-terminated 字串處理全部都可以沿用,這使得從 C 語言的 strcpy(),到一堆網路上已經跑很久的通訊協定,都可以繼續沿用。

極高的辨識性

UTF-8 很容易被判斷出來,引用維基百科的數字:

The probability of a random string of bytes which is not pure ASCII being valid UTF-8 is 3.9% for a two-byte sequence, and decreases exponentially for longer sequences.

非 ASCII 字串只要稍微有長度 (四個中文字,12 bytes?),判斷字串是否為 UTF-8 的正確性應該跟各種服務的 SLA 有得拼...

與 Unicode 的順序對應相容

Unicode 的編號順序與 UTF-8 相容,也就是說連傳統的 strcmp() 都可以直接拿來用:

Sorting a set of UTF-8 encoded strings as strings of unsigned bytes yields the same order as sorting the corresponding Unicode strings lexicographically by codepoint.

避開 UTF-16 的 BOM

BOM 的 0xFE 與 0xFF 在合法 UTF-8 文件裡是看不到的,所以如果開頭有看到 BOM 時一定不是 UTF-8:

The bytes 0xFE and 0xFF do not appear, so a valid UTF-8 stream never matches the UTF-16 byte order mark and thus cannot be confused with it.

self-synchronizing 特性

由於 encoding 的特性,UTF-8 字串要找下一個斷點是很容易的:

找到符合這六種開頭的 string pattern 就是斷點。也因為如此,容錯率相當高。

可以容納所有 Unicode 字元

也因為 encoding 特性,UTF-8 理論值可以容納百萬個字元 (依照 RFC3629 的額外限制,是 1112064 個)。在還沒有找到很多外星文明之前,應該都還夠用。(2012 年發佈的 Unicode 6.2 也才十一萬個字元,110182 個字元)

Unicode 與 UTF-8 之間的轉換很方便

再次因為 encoding 特性,轉換幾乎是 bit 運算就可以操作完畢。(注意 Last code point 的值都切齊)

因為太多好處,變成超級標準了...

這是一個幾乎找不到缺點的 standard,所以早期很多 programmer 選擇的原因是「看了就喜歡」,於是就有大量的 library。接下來有大量的 standard (這還包括 XML standard) 直接挑明講 UTF-8 的處理能力是必要條件。

總結...

UTF-8 encoding 怎麼看都很 hack (看起來很隨意的把不同 Unicode 區段切割到不定寬度字集內,感覺不到特別的處理),但卻很完美的解決了「如果可以處理 8bits 時,要與現有系統相容」的問題。也因為這個 encoding 把問題解決得很乾淨 (UTF-8 解決不了的,其他人都解不乾淨),於是就變成超級主流 encodoing...

Google 在 2012 年 2 月時就寫過一篇「Unicode over 60 percent of the web」,這還是扣掉 ASCII 的 20%!

現在是 2013 年快年尾了,可以預期之後是 UTF-8 萬萬歲了...

如果想要了解更細,可以參考維基百科的「Comparison of Unicode encodings」,裡面有與其他 Unicode 格式的比較。