最近 Linux 核心安全性問題的 Dirty Pipe 故事很有趣...

Hacker News 上看到「The Dirty Pipe Vulnerability」這個 Linux kernel 的安全性問題,Hacker News 上相關的討論在「The Dirty Pipe Vulnerability (cm4all.com)」這邊可以看到。

這次出包的是 splice() 的問題,先講他寫出可重製 bug 的程式碼,首先是第一個程式用 user1 放著跑:

#include <unistd.h>
int main(int argc, char **argv) {
  for (;;) write(1, "AAAAA", 5);
// ./writer >foo

然後第二個程式也放著跑 (可以是不同的 user2,完全無法碰到 user1 的權限):

#define _GNU_SOURCE
#include <unistd.h>
#include <fcntl.h>
int main(int argc, char **argv) {
  for (;;) {
    splice(0, 0, 1, 0, 2, 0);
    write(1, "BBBBB", 5);
// ./splicer <foo |cat >/dev/null

理論上不會在 foo 裡面看到任何 BBBBB 的字串,但卻打穿了... 透過 git bisect 的檢查,他也確認了是在「pipe: merge anon_pipe_buf*_ops」這個 commit 時出的問題。

不過找到問題的過程拉的頗長,一開始是有 web hosting 服務的 support ticket 說 access log 下載下來發現爛掉了,無法解壓縮:

It all started a year ago with a support ticket about corrupt files. A customer complained that the access logs they downloaded could not be decompressed. And indeed, there was a corrupt log file on one of the log servers; it could be decompressed, but gzip reported a CRC error.


I fixed the file’s CRC manually, closed the ticket, and soon forgot about the problem.

接下來過幾個月後又發生,經過幾次的 support ticket 後他手上就有一些「資料」可以看:

Months later, this happened again and yet again. Every time, the file’s contents looked correct, only the CRC at the end of the file was wrong. Now, with several corrupt files, I was able to dig deeper and found a surprising kind of corruption. A pattern emerged.


None of this made sense, but new support tickets kept coming in (at a very slow rate). There was some systematic problem, but I just couldn’t get a grip on it. That gave me a lot of frustration, but I was busy with other tasks, and I kept pushing this file corruption problem to the back of my queue.

後來真的花時間下去找,利用先前的 pattern 掃了一次系統 log,發現有規律在:

External pressure brought this problem back into my consciousness. I scanned the whole hard disk for corrupt files (which took two days), hoping for more patterns to emerge. And indeed, there was a pattern:

  • there were 37 corrupt files within the past 3 months
  • they occurred on 22 unique days
  • 18 of those days have 1 corruption
  • 1 day has 2 corruptions (2021-11-21)
  • 1 day has 7 corruptions (2021-11-30)
  • 1 day has 6 corruptions (2021-12-31)
  • 1 day has 4 corruptions (2022-01-31)

The last day of each month is clearly the one which most corruptions occur.

然後就試著寫各種 reproducible code,最後成功的版本就是開頭提到的,然後他發現這個漏洞可以是 security vulnerability,就回報出去了,可以看到前後從第一次的 support ticket 到最後解決花了快一年的時間,不過 Linux kernel 端修正的速度蠻快的:

  • 2021-04-29: first support ticket about file corruption
  • 2022-02-19: file corruption problem identified as Linux kernel bug, which turned out to be an exploitable vulnerability
  • 2022-02-20: bug report, exploit and patch sent to the Linux kernel security team
  • 2022-02-21: bug reproduced on Google Pixel 6; bug report sent to the Android Security Team
  • 2022-02-21: patch sent to LKML (without vulnerability details) as suggested by Linus Torvalds, Willy Tarreau and Al Viro
  • 2022-02-23: Linux stable releases with my bug fix (5.16.11, 5.15.25, 5.10.102)
  • 2022-02-24: Google merges my bug fix into the Android kernel
  • 2022-02-28: notified the linux-distros mailing list
  • 2022-03-07: public disclosure

整個故事還蠻精彩的 XD

Ingo Molnár 提出讓 Linux Kernel 編譯速度提昇的 mega patch

Hacker News 首頁上看到「Massive ~2.3k Patch Series Would Improve Linux Build Times 50~80% & Fix "Dependency Hell"」這個,對應到 mailing list 上的信件是「* [PATCH 0000/2297] [ANNOUNCE, RFC] "Fast Kernel Headers" Tree -v1: Eliminate the Linux kernel's "Dependency Hell"」這個,看到「0000/2297」這個 prefix XDDD

他主要是想要改善 Linux Kernel 的 compile 時間 (從 project 的名稱「Fast Kernel Headers」可以看到),只是沒想到會縮短這麼多。另外一方面也順便處理了 dependency hell 的問題 (改善維護性)。

測試出來的結果相當驚人,從 231.34 +- 0.60 secs (15.5 builds/hour) 到 129.97 +- 0.51 secs (27.7 builds/hour),以編譯次數來看的話是 78% 的改善。如果以 CPU time 來看的話,從 11,474,982.05 msec cpu-clock 降到 7,100,730.37 msec cpu-clock,也是以編譯次數來算的話,有 61.6% 的改善...



When I started this project, late 2020, I expected there to be maybe 50-100 patches. I did a few crude measurements that suggested that about 20% build speed improvement could be gained by reducing header dependencies, without having a substantial runtime effect on the kernel. Seemed substantial enough to justify 50-100 commits.


But as the number of patches increased, I saw only limited performance increases. By mid-2021 I got to over 500 commits in this tree and had to throw away my second attempt (!), the first two approaches simply didn't scale, weren't maintainable and barely offered a 4% build speedup, not worth the churn of 500 patches and not worth even announcing.


With the third attempt I introduced the per_task() machinery which brought the necessary flexibility to reduce dependencies drastically, and it was a type-clean approach that improved maintainability. But even at 1,000 commits I barely got to a 10% build speed improvement. Again this was not something I felt comfortable pushing upstream, or even announcing. :-/


But the numbers were pretty clear: 20% performance gains were very much possible. So I kept developing this tree, and most of the speedups started arriving after over 1,500 commits, in the fall of 2021. I was very surprised when it went beyond 20% speedup and more, then arrived at the current 78% with my reference config. There's a clear super-linear improvement property of kernel build overhead, once the number of dependencies is reduced to the bare minimum.

這次的 patch 雖然超大包,但看起來對於 compile 時間改善非常多,應該會有不少討論... 消息還蠻新的 (台灣時間今天早上五點的信),晚點可以看一下其他大老出來回什麼...

在 ZFS 上跑 PostgreSQL 的調校

在「Everything I've seen on optimizing Postgres on ZFS」這邊看到如果要在 ZFS 上面跑 PostgreSQL 時的調校方式,看起來作者有一直在更新這篇,所以需要的時候可以跑去看...

主要的族群是要搞 self-hosted PostgreSQL 的人,相較於 ext4 或是 XFS,底層如果使用 ZFS 可以做許多事情,像是 compression 與 snapshot,這對於很多 DBA 相關的操作會方便不少,但也因為 ZFS 的關係,兩邊 (& PostgreSQL) 需要一起調整以確保效能...

不過短期應該還是用 RDS 就是了...

Linux Kernel 裡的 RNG 從 SHA-1 換成 BLAKE2s

Hacker News Daily 上看到的消息,Linux Kernel 裡的 RNG,裡面用到的 SHA-1 演算法換成 BLAKE2s 了:

SHA-1 已知的問題是個隱患,不過換成 BLAKE2s 應該是 maintainer 的偏好,Jason Donenfeld 在 WireGuard 裡面也是用 BLAKE2s...

用 Exodus 打包 Linux ELF 檔案到其他機器上

前幾天在 Hacker News Daily 上看到的工具:「Exodus」,官方的說明是這樣:

Painless relocation of Linux binaries–and all of their dependencies–without containers.

技術上是把 Linux ELF 檔案搬到其他機器上以外,也幫你把對應的 dynamic library 都一起包進去:

  • Finding and bundling all of a binary's dependencies.
  • Launching the binary in such a way that the proper dependencies are used without any potential interaction from system libraries on the destination machine.

而 Linux 的 Kernel 因為有儘量維持 ABI compatibility,應該是不會有太大的問題,除非剛好用到新的 API...

看起來是個除了用 static compile 以外的解法,好像可以來弄弄看 FFmpeg

獨立遊戲創作者推出 Linux 版的好處


Hacker News 首頁上翻到的,以這個 upvote 數量來看,應該會收到今天的 Hacker News Daily 上:「Despite having just 5.8% sales, over 38% of bug reports come from Linux (reddit.com)」。

作者是一位獨立遊戲開發者,在兩年前推出了「ΔV: Rings of Saturn」這款遊戲,並且也發佈了 Linux 版。

作者先就數字來比較,他賣出了 12000 套,其中 700 套是 Linux 玩家;另外他收到了 1040 個 bug report,其中大約 400 個是從 Linux 玩家回報的:

As of today, I sold a little over 12,000 units of ΔV in total. 700 of these units were bought by Linux players. That’s 5.8%. I got 1040 bug reports in total, out of which roughly 400 are made by Linux players.

That’s one report per 11.5 users on average, and one report per 1.75 Linux players. That’s right, an average Linux player will get you 650% more bug reports.

看文章時可能會覺得「Linux 玩家真難伺候」,但實際上作者講到,這 400 個 bug 裡面只有 3 個 bug 是平台相關的 (只會發生在 Linux 上),其他的 bug 其實是所有平台都會發生:

A lot of extra work for just 5.8% of extra units, right?

Wrong. Bugs exist whenever you know about them, or not.

Do you know how many of these 400 bug reports were actually platform-specific? 3. Literally only 3 things were problems that came out just on Linux. The rest of them were affecting everyone[.]

原因是 Linux 社群在參與各種 open source project 養成的習慣,會盡可能把各種資訊講清楚,並且找出可以重製問題的方式:

The thing is, the Linux community is exceptionally well trained in reporting bugs. That is just the open-source way. This 5.8% of players found 38% of all the bugs that affected everyone. Just like having your own 700-person strong QA team. That was not 38% extra work for me, that was just free QA!

But that’s not all. The report quality is stellar.

與一般玩家的回報方式完全不同,Linux 玩家很習慣就會附上基本的環境資訊,系統紀錄,甚至有時候還會包括 core dump 與 reproducible steps:

I mean we have all seen bug reports like: “it crashes for me after a few hours”. Do you know what a developer can do with such a report? Feel sorry at best. You can’t really fix any bug unless you can replicate it, see it with your own eyes, peek inside and finally see that it’s fixed.

And with bug reports from Linux players is just something else. You get all the software/os versions, all the logs, you get core dumps and you get replication steps. Sometimes I got with the player over discord and we quickly iterated a few versions with progressive fixes to isolate the problem. You just don’t get that kind of engagement from anyone else.

不知道有沒有遇到回報 GDB 資訊的...

很特別的分享 XDDD

Ubuntu 下的滑鼠滾輪速度

這陣子因為經常切回 WindowsD2R,發現 Windows 下的滾輪速度快多了,回到 Ubuntu 20.04 下發現無法調整滑鼠滾輪的速度,找了一些方案測試,發現居然地雷還是超多 XD

搜尋可以找到「Increase mouse wheel scroll speed」與「How to change my mouse wheel scroll rate?」這兩篇,被推最多的都是 imwheel,但這套軟體的最新版是 2004 年,實際上用就會發現配合現代的系統 bug 很多...

另外用的方案是「Mouse scroll wheel acceleration, implemented in user space」,作者用 Python 去控制加速,測了一下正常多了。範例給的 ./main.py -v --exp 1 其中的 --exp 1 實際用起來有點太快,我改成 0.75 比較習慣。

先照著作者提到的,把 dependency 都裝起來,接下來掛到 Session and Startup 裡面,在登入後跑起來就可以了:

Ubuntu 14.04 與 16.04 的 ESM 從八年延長到十年

本來的舊的 Ubuntu ESM 是額外的三年 (加上本來的 LTS 五年,共八年),14.04 會支援到 2022 年四月 (參考 Internet Archive 上的存檔資料「Ubuntu 14.04 LTS has transitioned to ESM support」),然後 16.04 會支援到 2024 年四月 (參考 Internet Archive 上的存檔資料「Ubuntu 16.04 LTS transitions to Extended Security Maintenance (ESM)」),而 18.04 與 20.04 以後的 Ubuntu ESM 則是額外五年。

現在則是宣佈 14.04 與 16.04 都切齊額外五年了,所以總共都是十年:「Ubuntu 14.04 and 16.04 lifecycle extended to ten years」。

另外在 Hacker News 上的討論可以看一下:「Ubuntu 14.04 and 16.04 lifecycle extended to ten years (ubuntu.com)」,有人覺得這個政策很糟,但我覺得還好,有些商業環境就是花錢解決懶得升級... (沒有 support 只有 security update 的方案,一台實體機器才 USD$225/year,如果是虛擬機的話就更便宜了)

算是對付 legacy application 還蠻重要的方案...

用 Ephemeral Storage 加速 MySQL over ZFS 的效能

Percona 的「MySQL/ZFS in the Cloud, Leveraging Ephemeral Storage」這篇裡面在探討是不是可以看看 ZFS 在 Ephemeral Storage (機器附的本地硬碟) 上的效能。

一開始測試是直接當主力硬碟來測,可以看到跑 ZFS 的情況下,本地的 storage 還是會比 SSD Premium (這是 Azure 的產品線) 還快不少:

但把資料放在本地的 storage 上其實有點刺激,至少在 production 應該不太會這樣搞,所以後面用 L2ARC 的方式來測,可以看到效率提昇相當明顯,甚至接近本來直接把資料放在本地的 storage:

另外測了 ext4/bcache,看起來效率就沒那麼好:


很多 ls 開頭的指令...

Twitter 上看到 ls* 指令:

文章是「ls* Commands Are Even More Useful Than You May Have Thought」這篇,本來以為是要裝軟體的,結果發現在講的都是已經內建裝好的指令...


  • lsblk (儲存裝置相關)
  • lshw (硬體)
  • lscpu (CPU)
  • lspci (PCI)
  • lsusb (USB)

可以用一次有個感覺就好,這個年頭遇到問題時還是會靠搜尋引擎找答案 XDDD