Linux 6.2 的 Btrfs 改進

Hacker News 上看到 Btrfs 的改善消息:「Btrfs With Linux 6.2 Bringing Performance Improvements, Better RAID 5/6 Reliability」,對應的討論在「 Btrfs in Linux 6.2 brings performance improvements, better RAID 5/6 reliability (phoronix.com)」這邊。

因為 ext4 本身很成熟了,加上特殊的需求反而會去用 OpenZFS,就很久沒關注 Btrfs 了,這次看到 Btrfs 在 Linux 6.2 上的改進剛好可以重顧一下情況。

看起來是針對 RAID 模式下的改善,包括穩定性與效能,不過看起來是針對 RAID5 的部份多一點。

就目前的「情勢」看起來,Btrfs 之所以還是有繼續被發展,主要還是因為 OpenZFS 的授權條款是 CDDL,與 Linux kernel 用的 GPLv2 不相容,所以得分開維護。

但 OpenZFS 這邊的功能性與成熟度還是比 Btrfs 好不少,以現階段來說,如果架構上可以設計放 OpenZFS 的話應該還是會放 OpenZFS...

Linux 無線網路的 RCE 洞

Hacker News 首頁上看到 Linux 無線網路的 RCE 漏洞:「Some remotely exploitable kernel WiFi vulnerabilities」,mailing list 的信件是這邊:「[oss-security] Various Linux Kernel WLAN security issues (RCE/DOS) found」。

裡面題到了五個漏洞,其中屬於 RCE 的是這三個:

  • CVE-2022-41674: fix u8 overflow in cfg80211_update_notlisted_nontrans (max 256 byte overwrite) (RCE)
  • CVE-2022-42719: wifi: mac80211: fix MBSSID parsing use-after-free use after free condition (RCE)
  • CVE-2022-42720: wifi: cfg80211: fix BSS refcounting bugs ref counting use-after-free possibilities (RCE)

第一個只寫「An issue was discovered in the Linux kernel through 5.19.11.」,但討論上看到說應該是 5.1+,第二個在 CVE 裡面有提到是 5.2+,第三個是 5.1+。然後已經有看到 PoC code 了...

對於用 Linux 筆電的人得等各家 distribution 緊急出更新;但有些無線網路設備不知道怎麼辦...

現在的 vm.swappiness

查了一下現在的 vm.swappiness,發現跟以前又有一些差異。在「Documentation for /proc/sys/vm/」這邊可以看到說明。

很久以前應該是 0100,現在變成 0200 了,其中設定成 100 會在比重公式裡讓 memory 與 swap 的計算上有相同的比重:

This control is used to define the rough relative IO cost of swapping and filesystem paging, as a value between 0 and 200. At 100, the VM assumes equal IO cost and will thus apply memory pressure to the page cache and swap-backed pages equally; lower values signify more expensive swap IO, higher values indicates cheaper.

另外是設定 0 時的方式,在不夠用的時候還是會去用:

At 0, the kernel will not initiate swap until the amount of free and file-backed pages is less than the high watermark in a zone.

目前看起來之前建議設成 1 的方式應該是還 OK...

NVIDIA 開源 Linux GPU Kernel Driver

NVIDIA 宣佈開源 Linux 下的 GPU Kernel Driver:「NVIDIA Releases Open-Source GPU Kernel Modules」。

從一些描述上可以看出來,應該是因為 Datacenter 端的動力推動的,所以這次 open source 的版本中,對 Datacenter GPU 的支援是 production level,但對 GeForce GPU 與 Workstation GPU 的支援直接掛 alpha level:

Which GPUs are supported by Open GPU Kernel Modules?

Open kernel modules support all Ampere and Turing GPUs. Datacenter GPUs are supported for production, and support for GeForce and Workstation GPUs is alpha quality. Please refer to the Datacenter, NVIDIA RTX, and GeForce product tables for more details (Turing and above have compute capability of 7.5 or greater).

然後 user-mode driver 還是 closed source:

Will the source for user-mode drivers such as CUDA be published?

These changes are for the kernel modules; while the user-mode components are untouched. So the user-mode will remain closed source and published with pre-built binaries in the driver and the CUDA toolkit.

nouveau 來說,是可以從 open source driver 裡面挖一些東西出來用,不過能挖到跟 proprietary 同樣效能水準嗎?

Linux 打算合併 /dev/random 與 /dev/urandom 遇到的問題

Hacker News 上看到「Problems emerge for a unified /dev/*random (lwn.net)」的,原文是「Problems emerge for a unified /dev/*random」(付費內容,但是可以透過 Hacker News 上的連結直接看)。

標題提到的兩個 device 的性質會需要一些背景知識,可以參考維基百科上面「/dev/random」這篇的說明,兩個都是 CSPRNG,主要的分別在於 /dev/urandom 通常不會 block:

The /dev/urandom device typically was never a blocking device, even if the pseudorandom number generator seed was not fully initialized with entropy since boot.

/dev/random 不保證不會 block,有可能會因為 entropy 不夠而卡住:

/dev/random typically blocked if there was less entropy available than requested; more recently (see below, different OS's differ) it usually blocks at startup until sufficient entropy has been gathered, then unblocks permanently.

然後順便講一下,因為這是 crypto 相關的設計修改,加上是 kernel level 的界面,安全性以及相容性都會是很在意的點,而 Hacker News 上的討論裡面很多是不太在意這些的,你會看到很多「很有趣」的想法在上面討論 XDDD

回到原來的文章,Jason A. Donenfeld (Linux kernel 裡 RNG maintainer 之一,不過近期比較知名的事情還是 WireGuard 的發明人) 最近不斷的在改善 Linux kernel 裡面這塊架構,這次打算直接拿 /dev/random 換掉 /dev/urandom:「Uniting the Linux random-number devices」。

不過換完後 Google 的 Guenter Roeck 就在抱怨在 QEMU 環境裡面炸掉了:

This patch (or a later version of it) made it into mainline and causes a large number of qemu boot test failures for various architectures (arm, m68k, microblaze, sparc32, xtensa are the ones I observed). Common denominator is that boot hangs at "Saving random seed:". A sample bisect log is attached. Reverting this patch fixes the problem.

他透過 git bisect 找到發生問題的 commit,另外從卡住的訊息也可以大概猜到在虛擬機下 entropy 不太夠。

另外從他們三個 (加上 Linus) 在 mailing list 上面討論的訊息可以看到不少交流:「Re: [PATCH v1] random: block in /dev/urandom」,包括嘗試「餵」entropy 進 /dev/urandom 的 code...

後續看起來還會有一些嘗試,但短期內看起來應該還是會先分開...

最近 Linux 核心安全性問題的 Dirty Pipe 故事很有趣...

Hacker News 上看到「The Dirty Pipe Vulnerability」這個 Linux kernel 的安全性問題,Hacker News 上相關的討論在「The Dirty Pipe Vulnerability (cm4all.com)」這邊可以看到。

這次出包的是 splice() 的問題,先講他寫出可重製 bug 的程式碼,首先是第一個程式用 user1 放著跑:

#include <unistd.h>
int main(int argc, char **argv) {
  for (;;) write(1, "AAAAA", 5);
}
// ./writer >foo

然後第二個程式也放著跑 (可以是不同的 user2,完全無法碰到 user1 的權限):

#define _GNU_SOURCE
#include <unistd.h>
#include <fcntl.h>
int main(int argc, char **argv) {
  for (;;) {
    splice(0, 0, 1, 0, 2, 0);
    write(1, "BBBBB", 5);
  }
}
// ./splicer <foo |cat >/dev/null

理論上不會在 foo 裡面看到任何 BBBBB 的字串,但卻打穿了... 透過 git bisect 的檢查,他也確認了是在「pipe: merge anon_pipe_buf*_ops」這個 commit 時出的問題。

不過找到問題的過程拉的頗長,一開始是有 web hosting 服務的 support ticket 說 access log 下載下來發現爛掉了,無法解壓縮:

It all started a year ago with a support ticket about corrupt files. A customer complained that the access logs they downloaded could not be decompressed. And indeed, there was a corrupt log file on one of the log servers; it could be decompressed, but gzip reported a CRC error.

然後他先手動處理就把票關起來了:

I fixed the file’s CRC manually, closed the ticket, and soon forgot about the problem.

接下來過幾個月後又發生,經過幾次的 support ticket 後他手上就有一些「資料」可以看:

Months later, this happened again and yet again. Every time, the file’s contents looked correct, only the CRC at the end of the file was wrong. Now, with several corrupt files, I was able to dig deeper and found a surprising kind of corruption. A pattern emerged.

然後因為發生的頻率也不是很高,加上邏輯上卡到死胡同,所以他也沒有辦法花太多時間在上面:

None of this made sense, but new support tickets kept coming in (at a very slow rate). There was some systematic problem, but I just couldn’t get a grip on it. That gave me a lot of frustration, but I was busy with other tasks, and I kept pushing this file corruption problem to the back of my queue.

後來真的花時間下去找,利用先前的 pattern 掃了一次系統 log,發現有規律在:

External pressure brought this problem back into my consciousness. I scanned the whole hard disk for corrupt files (which took two days), hoping for more patterns to emerge. And indeed, there was a pattern:

  • there were 37 corrupt files within the past 3 months
  • they occurred on 22 unique days
  • 18 of those days have 1 corruption
  • 1 day has 2 corruptions (2021-11-21)
  • 1 day has 7 corruptions (2021-11-30)
  • 1 day has 6 corruptions (2021-12-31)
  • 1 day has 4 corruptions (2022-01-31)

The last day of each month is clearly the one which most corruptions occur.

然後就試著寫各種 reproducible code,最後成功的版本就是開頭提到的,然後他發現這個漏洞可以是 security vulnerability,就回報出去了,可以看到前後從第一次的 support ticket 到最後解決花了快一年的時間,不過 Linux kernel 端修正的速度蠻快的:

  • 2021-04-29: first support ticket about file corruption
  • 2022-02-19: file corruption problem identified as Linux kernel bug, which turned out to be an exploitable vulnerability
  • 2022-02-20: bug report, exploit and patch sent to the Linux kernel security team
  • 2022-02-21: bug reproduced on Google Pixel 6; bug report sent to the Android Security Team
  • 2022-02-21: patch sent to LKML (without vulnerability details) as suggested by Linus Torvalds, Willy Tarreau and Al Viro
  • 2022-02-23: Linux stable releases with my bug fix (5.16.11, 5.15.25, 5.10.102)
  • 2022-02-24: Google merges my bug fix into the Android kernel
  • 2022-02-28: notified the linux-distros mailing list
  • 2022-03-07: public disclosure

整個故事還蠻精彩的 XD

Ingo Molnár 提出讓 Linux Kernel 編譯速度提昇的 mega patch

Hacker News 首頁上看到「Massive ~2.3k Patch Series Would Improve Linux Build Times 50~80% & Fix "Dependency Hell"」這個,對應到 mailing list 上的信件是「* [PATCH 0000/2297] [ANNOUNCE, RFC] "Fast Kernel Headers" Tree -v1: Eliminate the Linux kernel's "Dependency Hell"」這個,看到「0000/2297」這個 prefix XDDD

他主要是想要改善 Linux Kernel 的 compile 時間 (從 project 的名稱「Fast Kernel Headers」可以看到),只是沒想到會縮短這麼多。另外一方面也順便處理了 dependency hell 的問題 (改善維護性)。

測試出來的結果相當驚人,從 231.34 +- 0.60 secs (15.5 builds/hour) 到 129.97 +- 0.51 secs (27.7 builds/hour),以編譯次數來看的話是 78% 的改善。如果以 CPU time 來看的話,從 11,474,982.05 msec cpu-clock 降到 7,100,730.37 msec cpu-clock,也是以編譯次數來算的話,有 61.6% 的改善...

這是花了一年多的時間嘗試才達成的目標,嘗試不同的方法,前幾次雖然都有改善,但改善幅度太小,變動卻太大,他覺得不值得丟出來,直到第三次才達成這樣的目標...

第一次:

When I started this project, late 2020, I expected there to be maybe 50-100 patches. I did a few crude measurements that suggested that about 20% build speed improvement could be gained by reducing header dependencies, without having a substantial runtime effect on the kernel. Seemed substantial enough to justify 50-100 commits.

第二次:

But as the number of patches increased, I saw only limited performance increases. By mid-2021 I got to over 500 commits in this tree and had to throw away my second attempt (!), the first two approaches simply didn't scale, weren't maintainable and barely offered a 4% build speedup, not worth the churn of 500 patches and not worth even announcing.

第三次:

With the third attempt I introduced the per_task() machinery which brought the necessary flexibility to reduce dependencies drastically, and it was a type-clean approach that improved maintainability. But even at 1,000 commits I barely got to a 10% build speed improvement. Again this was not something I felt comfortable pushing upstream, or even announcing. :-/

然後基於第三次的成果覺得有望,意外的發現後續的速度改善比想像中的多非常多:

But the numbers were pretty clear: 20% performance gains were very much possible. So I kept developing this tree, and most of the speedups started arriving after over 1,500 commits, in the fall of 2021. I was very surprised when it went beyond 20% speedup and more, then arrived at the current 78% with my reference config. There's a clear super-linear improvement property of kernel build overhead, once the number of dependencies is reduced to the bare minimum.

這次的 patch 雖然超大包,但看起來對於 compile 時間改善非常多,應該會有不少討論... 消息還蠻新的 (台灣時間今天早上五點的信),晚點可以看一下其他大老出來回什麼...

Linux Kernel 裡的 RNG 從 SHA-1 換成 BLAKE2s

Hacker News Daily 上看到的消息,Linux Kernel 裡的 RNG,裡面用到的 SHA-1 演算法換成 BLAKE2s 了:

SHA-1 已知的問題是個隱患,不過換成 BLAKE2s 應該是 maintainer 的偏好,Jason Donenfeld 在 WireGuard 裡面也是用 BLAKE2s...

SCTP over UDP

Red Hat 的 blog 上看到 SCTP over UDP 的方式,讓不支援 SCTP 的 NAT 裝置也能使用 SCTP:「An easier way to go: SCTP over UDP in the Linux kernel」。

這個方式也是個標準,在 RFC 6951:「UDP Encapsulation of Stream Control Transmission Protocol (SCTP) Packets for End-Host to End-Host Communication」。

Linux kernel 則是在 5.11 之後才支援,不過翻了一下 5.11 是一般版,在 2021 年二月釋出,在五月底就已經 EoL:

Stream Control Transmission Protocol over User Datagram Protocol (SCTP over UDP, also known as UDP encapsulation of SCTP) is a feature defined in RFC6951 and implemented in the Linux kernel space since 5.11.0.

上一個 LTS 版本是 5.10,下一個不知道是什麼時候了...

Linux Kernel 與明尼蘇達大學之間的攻防

Linux Kernel Community 與明尼蘇達大學 (UMN) 之間的事件差不多告一段落了,整理一下裡面比較重要的事件。隔壁棚 Basecamp 的事情還在燒,讓子彈多飛一點時間,等該跑出來的內部資訊都跑出來以後再來整理...

Linux Kernel 這件事情各家媒體都有整理出來,這邊拉 ZDNet 的文章來看:

講一下我的感想,因為 UMN 可以從這次事件證明了 Linux Kernel Community 沒有足夠的能力抵禦這類惡意攻擊,而且 Linux Kernel Community 也沒有打算解決這件事情,如果要比喻的話,很像台灣常看到的「解決發現問題的人」。

只要流程沒有改善,幾乎可以預測出之後會有政府資助的方式塞 buggy patch 進去埋洞。

作為 Linux 作業系統使用者,看起來沒什麼可以改變的,只能從架構面上設計出來安全界線,讓被攻進來時有一些防線防止直接打穿到底...