Nvidia 在 Linux 上安裝核心驅動程式時將建議使用開源版本

在「NVIDIA to install open Linux kernel modules by default」這邊看到的新聞,引用的連結是官方的討論區「Unix graphics feature deprecation schedule」這篇。

從 560 開始會議建議使用開源版本:

Starting in the release 560 series, it will be recommended to use the open flavor of NVIDIA Linux Kernel Modules 204 wherever possible (Turing or later GPUs, or Ada or later when using GPU virtualization).

點進去看「Open Linux Kernel Modules」這頁可以看到開源版本有一些專屬功能 (在「The following features will only work with the open kernel modules flavor of the driver」這段),蛋也有一些功能是開源版本沒有的 (在「The following features are not yet supported by the open kernel modules」這段)。

另外 Known Issues 這邊有提到些效能與功耗上的差異。

看起來是 porting 的差不多了?我覺得可以再等一兩個版本 XD

Bcachefs 進入 Linux Kernel 6.7 主線了

Bcachefs 是 Linux 下一個新的 filesystem (但也發展了好幾年),剛剛看到進入 Linux Kernel 6.7 的主線了:「Bcachefs Merged Into The Linux 6.7 Kernel」。

看起來沒搭上 6.6 的列車 (前幾天出的,2023/10/30),但以目前 Linux Kernel 的步調來看,6.7 應該是兩個月後就會釋出,Ubuntu 有機會在明年的 24.04 內建...

從官網列出來的功能可以知道,Bcachefs 實作了很多現代 filesystem 會發展的功能,像是 compression、encryption 以及 snapshots,另外底層也實作了 checksum 與 copy on write。

這樣看起來,Bcachefs 目前在 Linux 上主要的競爭對象應該會是 OpenZFS。真正的比較應該會等到 6.7 的 rc 版本就會有人下去測,到時候再看看,甚至看看有沒有機會取代 ext4 變成預設的 filesystem。

Linux Kernel 後續的 LTS 版本將縮短成兩年

在「Long-term support for Linux kernel to be cut as maintainence remains under strain」這邊看到 Linux Kernel 後續的 LTS 版本將縮短成兩年的消息:

Here's one major change coming down the road: Long-term support (LTS) for Linux kernels is being reduced from six to two years.

主要的原因是舊版用的人並不多:

Why? Simple, Corbet explained: "There's really no point to maintaining it for that long because people are not using them." I agree. While I'm sure someone out there is still running 4.14 in a production Linux system, there can't be many of them.

而目前的 LTS kernel 還是會走完本來計畫的時間,4.14、4.19、5.4 以及 5.10 從表上看都是六年,5.15 是五年,最新的 LTS 6.1 則是四年。

降到兩年的話,代表各家 Linux distribution 在 LTS kernel 跑完生命週期後就得自己維護安全性更新了,或是直接升級到另外一個 kernel 版本 (後者的方法風險高一點,不確定系統的相容性)。

看起來 5.10 與 6.1 會跑很久了,都到 2026 年十二月...

ReiserFS 被標為 Obsolete

八月底的時候看到「ReiserFS Officially Declared "Obsolete"」這則新聞,這個有進到 Linux kernel 的是 Reiser3,不是後來有人接手但沒有進到 Linux kernel 裡的 Reiser4

在 5.18 的時候先標成 deprecated:「Linux 5.18 Moves Ahead With Deprecating ReiserFS」。

這次的 6.6 則是標成 Obsolete,逐步從 Linux kernel 裡面拔除:

As part of updates to the older file-system drivers for Linux 6.6, the ReiserFS file-system is no longer marked as "Supported" but is officially treated as "Obsolete" within the Linux kernel.

目前各大 Linux 套件的預設檔案系統應該都是 ext4,另外有些特殊情境下 XFS 也蠻好用的 (像是資料庫),對於追求極限性能的情境下比 ext4 快一些。

憑著印象,加上查了說明確認,ResierFS 應該是在小檔時會有優勢:

Compared with ext2 and ext3 in version 2.4 of the Linux kernel, when dealing with files under 4 KiB and with tail packing enabled, ReiserFS may be faster.

不過這是前 SSD 時代的產物了,但也沒有看到後續的比較了...

Linux 6.2 的 Btrfs 改進

Hacker News 上看到 Btrfs 的改善消息:「Btrfs With Linux 6.2 Bringing Performance Improvements, Better RAID 5/6 Reliability」,對應的討論在「 Btrfs in Linux 6.2 brings performance improvements, better RAID 5/6 reliability (phoronix.com)」這邊。

因為 ext4 本身很成熟了,加上特殊的需求反而會去用 OpenZFS,就很久沒關注 Btrfs 了,這次看到 Btrfs 在 Linux 6.2 上的改進剛好可以重顧一下情況。

看起來是針對 RAID 模式下的改善,包括穩定性與效能,不過看起來是針對 RAID5 的部份多一點。

就目前的「情勢」看起來,Btrfs 之所以還是有繼續被發展,主要還是因為 OpenZFS 的授權條款是 CDDL,與 Linux kernel 用的 GPLv2 不相容,所以得分開維護。

但 OpenZFS 這邊的功能性與成熟度還是比 Btrfs 好不少,以現階段來說,如果架構上可以設計放 OpenZFS 的話應該還是會放 OpenZFS...

Linux 無線網路的 RCE 洞

Hacker News 首頁上看到 Linux 無線網路的 RCE 漏洞:「Some remotely exploitable kernel WiFi vulnerabilities」,mailing list 的信件是這邊:「[oss-security] Various Linux Kernel WLAN security issues (RCE/DOS) found」。

裡面題到了五個漏洞,其中屬於 RCE 的是這三個:

  • CVE-2022-41674: fix u8 overflow in cfg80211_update_notlisted_nontrans (max 256 byte overwrite) (RCE)
  • CVE-2022-42719: wifi: mac80211: fix MBSSID parsing use-after-free use after free condition (RCE)
  • CVE-2022-42720: wifi: cfg80211: fix BSS refcounting bugs ref counting use-after-free possibilities (RCE)

第一個只寫「An issue was discovered in the Linux kernel through 5.19.11.」,但討論上看到說應該是 5.1+,第二個在 CVE 裡面有提到是 5.2+,第三個是 5.1+。然後已經有看到 PoC code 了...

對於用 Linux 筆電的人得等各家 distribution 緊急出更新;但有些無線網路設備不知道怎麼辦...

現在的 vm.swappiness

查了一下現在的 vm.swappiness,發現跟以前又有一些差異。在「Documentation for /proc/sys/vm/」這邊可以看到說明。

很久以前應該是 0100,現在變成 0200 了,其中設定成 100 會在比重公式裡讓 memory 與 swap 的計算上有相同的比重:

This control is used to define the rough relative IO cost of swapping and filesystem paging, as a value between 0 and 200. At 100, the VM assumes equal IO cost and will thus apply memory pressure to the page cache and swap-backed pages equally; lower values signify more expensive swap IO, higher values indicates cheaper.

另外是設定 0 時的方式,在不夠用的時候還是會去用:

At 0, the kernel will not initiate swap until the amount of free and file-backed pages is less than the high watermark in a zone.

目前看起來之前建議設成 1 的方式應該是還 OK...

NVIDIA 開源 Linux GPU Kernel Driver

NVIDIA 宣佈開源 Linux 下的 GPU Kernel Driver:「NVIDIA Releases Open-Source GPU Kernel Modules」。

從一些描述上可以看出來,應該是因為 Datacenter 端的動力推動的,所以這次 open source 的版本中,對 Datacenter GPU 的支援是 production level,但對 GeForce GPU 與 Workstation GPU 的支援直接掛 alpha level:

Which GPUs are supported by Open GPU Kernel Modules?

Open kernel modules support all Ampere and Turing GPUs. Datacenter GPUs are supported for production, and support for GeForce and Workstation GPUs is alpha quality. Please refer to the Datacenter, NVIDIA RTX, and GeForce product tables for more details (Turing and above have compute capability of 7.5 or greater).

然後 user-mode driver 還是 closed source:

Will the source for user-mode drivers such as CUDA be published?

These changes are for the kernel modules; while the user-mode components are untouched. So the user-mode will remain closed source and published with pre-built binaries in the driver and the CUDA toolkit.

nouveau 來說,是可以從 open source driver 裡面挖一些東西出來用,不過能挖到跟 proprietary 同樣效能水準嗎?

Linux 打算合併 /dev/random 與 /dev/urandom 遇到的問題

Hacker News 上看到「Problems emerge for a unified /dev/*random (lwn.net)」的,原文是「Problems emerge for a unified /dev/*random」(付費內容,但是可以透過 Hacker News 上的連結直接看)。

標題提到的兩個 device 的性質會需要一些背景知識,可以參考維基百科上面「/dev/random」這篇的說明,兩個都是 CSPRNG,主要的分別在於 /dev/urandom 通常不會 block:

The /dev/urandom device typically was never a blocking device, even if the pseudorandom number generator seed was not fully initialized with entropy since boot.

/dev/random 不保證不會 block,有可能會因為 entropy 不夠而卡住:

/dev/random typically blocked if there was less entropy available than requested; more recently (see below, different OS's differ) it usually blocks at startup until sufficient entropy has been gathered, then unblocks permanently.

然後順便講一下,因為這是 crypto 相關的設計修改,加上是 kernel level 的界面,安全性以及相容性都會是很在意的點,而 Hacker News 上的討論裡面很多是不太在意這些的,你會看到很多「很有趣」的想法在上面討論 XDDD

回到原來的文章,Jason A. Donenfeld (Linux kernel 裡 RNG maintainer 之一,不過近期比較知名的事情還是 WireGuard 的發明人) 最近不斷的在改善 Linux kernel 裡面這塊架構,這次打算直接拿 /dev/random 換掉 /dev/urandom:「Uniting the Linux random-number devices」。

不過換完後 Google 的 Guenter Roeck 就在抱怨在 QEMU 環境裡面炸掉了:

This patch (or a later version of it) made it into mainline and causes a large number of qemu boot test failures for various architectures (arm, m68k, microblaze, sparc32, xtensa are the ones I observed). Common denominator is that boot hangs at "Saving random seed:". A sample bisect log is attached. Reverting this patch fixes the problem.

他透過 git bisect 找到發生問題的 commit,另外從卡住的訊息也可以大概猜到在虛擬機下 entropy 不太夠。

另外從他們三個 (加上 Linus) 在 mailing list 上面討論的訊息可以看到不少交流:「Re: [PATCH v1] random: block in /dev/urandom」,包括嘗試「餵」entropy 進 /dev/urandom 的 code...

後續看起來還會有一些嘗試,但短期內看起來應該還是會先分開...

最近 Linux 核心安全性問題的 Dirty Pipe 故事很有趣...

Hacker News 上看到「The Dirty Pipe Vulnerability」這個 Linux kernel 的安全性問題,Hacker News 上相關的討論在「The Dirty Pipe Vulnerability (cm4all.com)」這邊可以看到。

這次出包的是 splice() 的問題,先講他寫出可重製 bug 的程式碼,首先是第一個程式用 user1 放著跑:

#include <unistd.h>
int main(int argc, char **argv) {
  for (;;) write(1, "AAAAA", 5);
}
// ./writer >foo

然後第二個程式也放著跑 (可以是不同的 user2,完全無法碰到 user1 的權限):

#define _GNU_SOURCE
#include <unistd.h>
#include <fcntl.h>
int main(int argc, char **argv) {
  for (;;) {
    splice(0, 0, 1, 0, 2, 0);
    write(1, "BBBBB", 5);
  }
}
// ./splicer <foo |cat >/dev/null

理論上不會在 foo 裡面看到任何 BBBBB 的字串,但卻打穿了... 透過 git bisect 的檢查,他也確認了是在「pipe: merge anon_pipe_buf*_ops」這個 commit 時出的問題。

不過找到問題的過程拉的頗長,一開始是有 web hosting 服務的 support ticket 說 access log 下載下來發現爛掉了,無法解壓縮:

It all started a year ago with a support ticket about corrupt files. A customer complained that the access logs they downloaded could not be decompressed. And indeed, there was a corrupt log file on one of the log servers; it could be decompressed, but gzip reported a CRC error.

然後他先手動處理就把票關起來了:

I fixed the file’s CRC manually, closed the ticket, and soon forgot about the problem.

接下來過幾個月後又發生,經過幾次的 support ticket 後他手上就有一些「資料」可以看:

Months later, this happened again and yet again. Every time, the file’s contents looked correct, only the CRC at the end of the file was wrong. Now, with several corrupt files, I was able to dig deeper and found a surprising kind of corruption. A pattern emerged.

然後因為發生的頻率也不是很高,加上邏輯上卡到死胡同,所以他也沒有辦法花太多時間在上面:

None of this made sense, but new support tickets kept coming in (at a very slow rate). There was some systematic problem, but I just couldn’t get a grip on it. That gave me a lot of frustration, but I was busy with other tasks, and I kept pushing this file corruption problem to the back of my queue.

後來真的花時間下去找,利用先前的 pattern 掃了一次系統 log,發現有規律在:

External pressure brought this problem back into my consciousness. I scanned the whole hard disk for corrupt files (which took two days), hoping for more patterns to emerge. And indeed, there was a pattern:

  • there were 37 corrupt files within the past 3 months
  • they occurred on 22 unique days
  • 18 of those days have 1 corruption
  • 1 day has 2 corruptions (2021-11-21)
  • 1 day has 7 corruptions (2021-11-30)
  • 1 day has 6 corruptions (2021-12-31)
  • 1 day has 4 corruptions (2022-01-31)

The last day of each month is clearly the one which most corruptions occur.

然後就試著寫各種 reproducible code,最後成功的版本就是開頭提到的,然後他發現這個漏洞可以是 security vulnerability,就回報出去了,可以看到前後從第一次的 support ticket 到最後解決花了快一年的時間,不過 Linux kernel 端修正的速度蠻快的:

  • 2021-04-29: first support ticket about file corruption
  • 2022-02-19: file corruption problem identified as Linux kernel bug, which turned out to be an exploitable vulnerability
  • 2022-02-20: bug report, exploit and patch sent to the Linux kernel security team
  • 2022-02-21: bug reproduced on Google Pixel 6; bug report sent to the Android Security Team
  • 2022-02-21: patch sent to LKML (without vulnerability details) as suggested by Linus Torvalds, Willy Tarreau and Al Viro
  • 2022-02-23: Linux stable releases with my bug fix (5.16.11, 5.15.25, 5.10.102)
  • 2022-02-24: Google merges my bug fix into the Android kernel
  • 2022-02-28: notified the linux-distros mailing list
  • 2022-03-07: public disclosure

整個故事還蠻精彩的 XD