原來有專有名詞:TOCTOU (Time-of-check to time-of-use)

看「The trouble with symbolic links」這篇的時候看到的專有名詞:「TOCTOU (Time-of-check to time-of-use)」,直翻是「先檢查再使用」,算是一個常見的 security (hole) pattern,因為檢查完後有可能被其他人改變,接著使用的時候就有可能產生安全漏洞。

在資料庫這類環境下,有 isolation (ACID 裡的 I) 可以確保不會發生這類問題 (需要 REPEATABLE-READ 或是更高的 isolation level)。

但在檔案系統裡面看起來不太順利,2004 年的時候研究出來沒有 portable 的方式可以確保避免 TOCTOU 的問題發生:

In the context of file system TOCTOU race conditions, the fundamental challenge is ensuring that the file system cannot be changed between two system calls. In 2004, an impossibility result was published, showing that there was no portable, deterministic technique for avoiding TOCTOU race conditions.

其中一種 mitigation 是針對 fd 監控:

Since this impossibility result, libraries for tracking file descriptors and ensuring correctness have been proposed by researchers.

然後另外一種方式 (比較治本) 是檔案系統的 API 支援 transaction,但看起來不被主流接受?

An alternative solution proposed in the research community is for UNIX systems to adopt transactions in the file system or the OS kernel. Transactions provide a concurrency control abstraction for the OS, and can be used to prevent TOCTOU races. While no production UNIX kernel has yet adopted transactions, proof-of-concept research prototypes have been developed for Linux, including the Valor file system and the TxOS kernel. Microsoft Windows has added transactions to its NTFS file system, but Microsoft discourages their use, and has indicated that they may be removed in a future version of Windows.

目前看起來的問題是沒有一個讓 Linux community 能接受的 API 設計?

Firefox 的 RCWN (Race Cache With Network)

前幾天 Hacker News 上看到「When network is faster than browser cache (2020) (simonhearne.com)」這則 2020 的文章,原文在「When Network is Faster than Cache」這邊,講 Firefox 在 2017 年導入了一個特別的設計,除了會在 cache 裡面抓資料以外,也會到網路上拉看看,有機會從網路上抓到的資料會比 cache 先得到,這個功能叫做 RCWN (Race Cache With Network):「Enable RCWN」。

開頭就先提到了有人回報 Firefox 上的 RCWN 似乎沒有明顯效果:「Tune RCWN racing parameters (and make them pref-able)」。

On my OSX box I'm seeing us race more than we probably need to:

Total network request count: 5574
Cache won count 938
Net won count 13

That's racing almost 16% of the time, but only winning 1.3% of the time. We should probably back off on racing a bit in this case, at least.

16% 的 request 決定 RCWN 兩邊打,但裡面只有 1.3% 是 network 比 cache 快。

不過作者決定試著再多找看看有沒有什麼方向可以確認,但測了很多項目都找不到哪個因素跟 cache retrieval time 有直接相關,反而在看看 Chromium 時發現 Chromium 是透過限制連線數量,降低 I/O 造成的問題:

It turns out that Chrome actively throttles requests, including those to cached resources, to reduce I/O contention. This generally improves performance, but will mean that pages with a large number of cached resources will see a slower retrieval time for each resource.

看起來就是個簡單粗暴的 workaround...

讀書時間:Meltdown 的攻擊方式

Meltdown 的論文可以在「Meltdown (PDF)」這邊看到。這個漏洞在 Intel 的 CPU 上影響最大,而在 AMD 是不受影響的。其他平台有零星的消息,不過不像 Intel 是這十五年來所有的 CPU 都中獎... (從 Pentium 4 以及之後的所有 CPU)

Meltdown 是基於這些前提,而達到記憶體任意位置的 memory dump:

  • 支援 µOP 方式的 out-of-order execution 以及當失敗時的 rollback 機制。
  • 因為 cache 機制造成的 side channel information leak。
  • 在 out-of-order execution 時對記憶體存取的 permission check 失效。

out-of-order execution 在大學時的計算機組織應該都會提到,不過我印象中當時只講「在確認不相干的指令才會有 out-of-order」。而現代 CPU 做的更深入,包括了兩個部份:

  • 第一個是 µOP 方式,將每個 assembly 拆成更細的 micro-operation,後面的 out-of-order execution 是對 µOP 做。
  • 第二個是可以先執行下去,如果發現搞錯了再 rollback。

像是下面的 access() 理論上不應該被執行到,但現代的 out-of-order execution 會讓 CPU 有機會先跑後面的指令,最後發現不該被執行到後,再將 register 與 memory 的資料 rollback 回來:

而 Meltdown 把後面不應該執行到 code 放上這段程式碼 (這是 Intel syntax assembly):

其中 mov al, byte [rcx] 應該要做記憶體檢查,確認使用者是否有權限存取那個位置。但這邊因為連記憶體檢查也拆成 µOP 平行跑,而產生 race condition:

Meltdown is some form of race condition between the fetch of a memory address and the corresponding permission check for this address.

而這導致後面這段不該被執行到的程式碼會先讀到資料放進 al register 裡。然後再去存取某個記憶體位置造成某塊記憶體位置被讀到 cache 裡。

造成 cache 內的資料改變後,就可以透過 FLUSH+RELOAD 技巧 (side channel) 而得知這段程式碼讀了哪一塊資料 (參考之前寫的「Meltdown 與 Spectre 都有用到的 FLUSH+RELOAD」),於是就能夠推出 al 的值...

而 Meltdown 在 mov al, byte [rcx] 這邊之所以可以成立,另外一個需要突破的地方是 [rcx]。這邊 [rcx] 存取時就算沒有權限檢查,在 virtual address 轉成 physical address 時應該會遇到問題?

原因是 LinuxOS X 上有 direct-physical map 的機制,會把整塊 physical memory 對應到 virtual memory 的固定位置上,這些位置不會再發給 user space 使用,所以是通的:

On Linux and OS X, this is done via a direct-physical map, i.e., the entire physical memory is directly mapped to a pre-defined virtual address (cf. Figure 2).

而在 Windows 上則是比較複雜,但大部分的 physical memory 都有對應到 kernel address space,而每個 process 裡面也都還是有完整的 kernel address space (只是受到權限控制),所以 Meltdown 的攻擊仍然有效:

Instead of a direct-physical map, Windows maintains a multiple so-called paged pools, non-paged pools, and the system cache. These pools are virtual memory regions in the kernel address space mapping physical pages to virtual addresses which are either required to remain in the memory (non-paged pool) or can be removed from the memory because a copy is already stored on the disk (paged pool). The system cache further contains mappings of all file-backed pages. Combined, these memory pools will typically map a large fraction of the physical memory into the kernel address space of every process.

這也是 workaround patch「Kernel page-table isolation」的原理 (看名字大概就知道方向了),藉由將 kernel 與 user 的區塊拆開來打掉 Meltdown 的攻擊途徑。

而 AMD 的硬體則是因為 mov al, byte [rcx] 這邊權限的檢查並沒有放進 out-of-order execution,所以就避開了 Meltdown 攻擊中很重要的一環。

無縫更換 symbolic link 所指的目錄或檔案

這邊如果把 atomatically 翻成原子性好像怪怪的,就照意思來翻好了。

這是一篇 2005 年的文章,講如何更換 symbolic link 內容,而且確保 symbolic link 不會短時間不見:「How to change symlinks atomically」。

作者拿了 strace 解釋 ln -snf 的例子,來說明這個方法沒辦法做到無縫:

$ strace ln -snf new current 2>&1 | grep link
unlink("current")         = 0
symlink("new", "current") = 0

unlink()symlink() 中間的 race condition 如果有人存取這個 symbolic link 就會失敗。作者提了這樣的方法來解決:

$ ln -s new current_tmp && mv -Tf current_tmp current

在「How does one atomically change a symlink to a directory in busybox?」這邊雖然提問的是 BusyBox,但道理相同,提到了怎麼做以及為什麼 (不要看綠色勾勾那個,看分數比較高的那個):

This can indeed be done atomically with rename(2), by first creating the new symlink under a temporary name and then cleanly overwriting the old symlink in one go.

DocumentRoot 是 symbolic link 時,這點變得很重要。這個方法才能避免切換目錄的過程中間不會有空檔,導致使用者收到 404...

另外通常會配合 mod_realdoc 一起用,避免程式用到 DocumentRoot 的路徑而導致前面指到的東西跟後面指到的東西不同。

英國計畫在 2018 年開始強制企業公佈男女的平均薪資及 Bonus

英國計畫從 2018 年開始,超過 250 人的公司必須公佈男女的平均薪資及 Bonus:「Companies will be forced to reveal their gender pay gap」:

The new rules, revealed on Friday, will apply to all companies with more than 250 employees.

除了平均薪資以及 bonus 外,還必須公開每個區間的人數:

In addition to publishing their average gender pay and bonus gap, around 8,000 employers across the country will also have to publish the number of men and women in each pay range.

目標是希望讓資訊更透明讓人力市場更健康:

The government is hoping that naming and shaming firms that pay women a lot less than men in the same jobs will push them to stop the practice, because it will make it harder for them to attract top talent.

可以看到目前估算出來的差異:

另外美國也在規劃類似的法案,不僅僅是性別,還包括了種族等其他資訊:

In the U.S., similar plans are also under discussions. President Obama announced a proposal earlier this month that would require companies with more than 100 employees to report how much they are paying their employees by race, ethnicity and gender.