AMD Zen 3 與 Zen 4 上 FSRM (Fast Short REP MOV) 的效能問題

前幾天 Hacker News 上討論到的一篇:「Rust std fs slower than Python? No, it's hardware (xuanwo.io)」,原文則是在「Rust std fs slower than Python!? No, it's hardware!」。

原因是作者收到回報,提到一段 Rust 寫的 code (在文章裡面的 read_file_with_opendal(),透過 OpenDAL 去讀) 比 Python 的 code 還慢 (在文章裡面的 read_file_with_normal(),直接用 Python 的 open() 開然後讀取)。

先講最後發現問題是 Zen 3 (桌機版 5 系列的 CPU) 與 Zen 4 (桌機版 7 系列的 CPU) 這兩個架構上 REP MOV 系列的指令在某些情境下 (與 offset 有關) 有效能上的問題。

FSRM 類的指令被用在 memcpy()memmove() 類的地方,算是很常見備用到的功能,這次追蹤的問題發現在 glibc 裡面用到導致效能異常。

另外也可以查到在 Linux kernel 裡面也有用到:「Linux 5.6 To Make Use Of Intel Ice Lake's Fast Short REP MOV For Faster memmove()」,所以後續應該也會有些改善的討論...

Ubuntu 這邊的 issue ticket 開在「Terrible memcpy performance on Zen 3 when using rep movsb」這,上游的 glibc 也有對應的追蹤:「30995 – Zen 4: sub-optimal memcpy on very large copies」。

從作者私下得知的消息,因為 patch space 的大小限制,AMD 可能無法提供 CPU microcode 上的 patch,直接解決問題:

However, unverified sources suggest that a fix via amd-ucode is unlikely (at least for Zen 3) due to limited patch space. If you have more information on this matter, please reach out to me.

所以目前比較可行的作法是在 glibc 裡面使用到 FSRM 的地方針對 Zen 3 與 Zen 4 放 workaround,回到原來沒有 FSRM 的方式處理:

Our only hope is to address this issue in glibc by disabling FSRM as necessary. Progress has been made on the glibc front: x86: Improve ERMS usage on Zen3. Stay tuned for updates.

另外在追蹤問題的過程遇到不同的情境,得拿出不同的 profiling 工具出來用,所以也還蠻值得看過一次有個印象:

一開始的 timeit 算是 Python 裡面簡單的 benchmark library:

接著的比較是用 command line 的工具 hyperfine 產生出來的 (給兩個 command 讓他跑),查了一下發現在 Ubuntu 官方的 apt repository 裡面有包進去 (22.04+):

再來是用 strace 追問題,這個算是經典工具了,可以拿來看 syscall 被呼叫的時間點:

到後面出現了 perf 可以拿來看更底層的資訊,像是 CPU 內 cache 的情況:

接續提到的「hotspot ASM」應該也還是 perf 輸出的格式,不過不是那麼確定... 在「perf Examples」這邊可以看到 function 的分析:

而文章裡的則是可以看到已經到 assembly 層級了:

差不多就這些...

Firefox 的 :has() 支援度問題

Can I Use... 上面的「:has() CSS relational pseudo-class」裡面可以看到 Firefox 一直都還沒支援,只有在 nightly 版本上面預設有開,這也代表還沒辦法很穩定的用這個 selector... (除非你直接忽略掉 Firefox)。

看到「Prodding Firefox to Update :has() Selection」這篇,Eric Meyer 在抱怨 Firefox 的這個問題,就算是 nightly 的版本也還是有奇怪的 bug,會抓不到 :has() 條件,需要用 contenteditable 的 workaround 觸發 Firefox 的計算。

MozillaBugzilla 上可以看到票還開著:「Implement the :has() pseudo class」,看起來這陣子是一直有在更新,應該是有資源在上面追進度。

等到真的支援,就可以在 querySelectorAll() 裡面直接用 :has() 了,現在 Firefox 的情況讓 Userscript 寫起來有點卡...

把 RabbitMQ 換成 PostgreSQL 的那篇文章...

Hacker News 上看到「SQL Maxis: Why We Ditched RabbitMQ and Replaced It with a Postgres Queue (prequel.co)」這篇文章,原文在「SQL Maxis: Why We Ditched RabbitMQ And Replaced It With A Postgres Queue」這邊,裡面在講他們把 RabbitMQ 換成 PostgreSQL 的前因後果。

文章裡面可以吐嘈的點其實蠻多的,而且在 Hacker News 上也有被點出來,像是有人就有提到他們遇到了 bug (或是 feature) 卻不解決 bug,而是決定直接改寫成用 PostgreSQL 來解決,其實很怪:

In summary -- their RabbitMQ consumer library and config is broken in that their consumers are fetching additional messages when they shouldn't. I've never seen this in years of dealing with RabbitMQ. This caused a cascading failure in that consumers were unable to grab messages, rightfully, when only one of the messages was manually ack'ed. Fixing this one fetch issue with their consumer would have fixed the entire problem. Switching to pg probably caused them to rewrite their message fetching code, which probably fixed the underlying issue.

另外一個吐嘈的點是量的部份,如果就這樣的量,用 PostgreSQL 降低使用的 tech stack 應該是個不錯的決定 (但另外一個問題就是,當初為什麼要導入 RabbitMQ...):

>To make all of this run smoothly, we enqueue and dequeue thousands of jobs every day.

If you your needs aren't that expensive, and you don't anticipate growing a ton, then it's probably a smart technical decision to minimize your operational stack. Assuming 10k/jobs a day, thats roughly 7 jobs per minute. Even the most unoptimized database should be able to handle this.

在同一個 thread 下面也有人提到這個量真的很小,甚至直接不講武德提到可以用 Jenkins 解 XD:

Years of being bullshitted have taught me to instantly distrust anyone who is telling me about how many things they do per day. Jobs or customers per day is something to tell you banker, or investors. For tech people it’s per second, per minute, maybe per hour, or self aggrandizement.

A million requests a day sounds really impressive, but it’s 12req/s which is not a lot. I had a project that needed 100 req/s ages ago. That was considered a reasonably complex problem but not world class, and only because C10k was an open problem. Now you could do that with a single 8xlarge. You don’t even need a cluster.

10k tasks a day is 7 per minute. You could do that with Jenkins.

然後意外看到 Simon Willison 提到了一個重點,就是 RabbitMQ 到現在還是不支援 ACID 等級的 job queuing (尤其是 Durability 的部份),也就是希望 MQ 系統回報成功收到的 task 一定會被處理:

The best thing about using PostgreSQL for a queue is that you can benefit from transactions: only queue a job if the related data is 100% guaranteed to have been written to the database, in such a way that it's not possible for the queue entry not to be written.

Brandur wrote a great piece about a related pattern here: https://brandur.org/job-drain

He recommends using a transactional "staging" queue in your database which is then written out to your actual queue by a separate process.

這也是當年為什麼用 MySQL 幹類似的事情,要 ACID 的特性來確保內容不會掉。

這也是目前我覺得唯一還需要用 RDBMS 當 queue backend 的地方,但原文公司的想法就很迷,遇到 library bug 後決定換架構,而不是想辦法解 bug,還很開心的寫一篇文章來宣傳...

uBlock Origin 1.48.0 的改善

Hacker News 上看到「uBlock Origin 1.48 adds readiness status, code viewer, and other fixes (github.com/gorhill)」這則消息,uBlock Origin 在 1.48 有個蠻重要的 UI/UX 改善 (Readiness status at browser launch)。

uBlock Origin 預設會搭配「工人智慧」維護的列表,這些列表通常都不小,在剛開瀏覽器,還在讀取的過程中去看網站會遇到阻擋不完整的情況。

先前沒有辦法知道這個問題,在這版加上了對應的 icon color 來解決,黃色表示還在讀:

這時候跑去逛網站的話會出現驚嘆號:

讀取完後 icon 會變成標準的紅色,但驚嘆號仍然會留著,表示這個頁面未必有完整過濾:

正常有阻擋的則是這樣:

理論上可以減少 bug report XDDD

To reduce the number of reports caused by this issue which is outside of uBO's control, uBO's toolbar icon will now reflect its readiness status at browser launch.

Mac 會自己改變 Desktop 位置的問題

以前好像沒遇過,換了 M1 以後才注意到 desktop 位置位自己被改變,覺得很阿雜... 找了資料才發現是個 "feature":「How to prevent Mac from changing the order of Desktops/Spaces」。

關掉就好了,網路上的資料最早出現在 2018 年左右,大概是那個時候被加進去的?

GNU Make 在 4.4 引入的 --shuffle

Hacker News 首頁上看到的,作者送了一個提案到 GNU Make,後來被採用,在 4.4 版引入了 --shuffle 指令:「New make --shuffle mode」。

這個功能主要是想要找出在 Makefile 裡面沒有被定義好,平常是因為 side effect 而沒有出錯的地方。

像是作者就發現 libgfortran 沒有把 libquadmath 放到 dependency 的問題:

For example gcc’s libgfortran is missing a libquadmath build dependency. It is natural not to encounter it in real world as libquadmath is usually built along with other small runtimes way before g++ or gfortran is ready.

他的基本想法是把 target 的順序打亂掉,也就是在有指定 --shuffle 時,不一定會照 a -> b -> c 的順序往下遞迴,而可能會是 c -> b -> a 或是其他的順序:

all: a b c

這樣對於抓那些在 -j 平行編譯時會出包的套件也很有幫助,不需要在 -j 開很大的情況下才能重製問題,而是平常就有機會在 CI 環境下被抓出來。

CSS 的 feature detection:@support

在「Conditional CSS」這篇裡面在講很多 CSS 條件過濾的方式,裡面看到有 @support 這個規格,可以透過 feature detection 的方式來過濾:「CSS at-rule: @supports: selector()」。

文章作者給的範例是這樣:

@supports selector(:has(p)) {
  .card-thumb {
    aspect-ratio: 1;
  }
}

在瀏覽器支援 :has(p) 的情況下才指定裡面的 CSS。

翻了一下 @support 在各家瀏覽器上實做的情況:在 Firefox 上是 69 開始支援,推出的日期是 2019/09/03。在 Chrome 上是 83 開始支援,推出的日期是 2020/05/19。在 Safari 上是 14.1 開始支援 (對應到 iOS 版本是 14.5),推出的日期是 2021/04/26。

從日期可以看出來算是比較新的功能,但主要幾個大的瀏覽器都支援了。

這個讓我想起來早期利用各家瀏覽器的 bug 產生出的各種 hack:「Browser Specific Hacks」。

解決 Ubuntu 重開機後麥克風聲音太小的問題

Ubuntu 桌機重開機後會遇到外接 USB 麥克風 AM310 的聲音會變得太小的問題,常常是開會的時候被同事提醒才去調。

查了一下是不是有 bug,看起來跟「Mic input volume always resets (to middle/low value) on resume or restart」這個 bug 有關,但這個回報是 16.04,到現在 22.04 都出了,好像沒有新進度...

接著就是找看看有沒有 workaround 可以用,其中一種想法是找出用 command line 設定音量的方式,這樣就可以在開機的時候自動執行。

接著就找到「How can i increase microphone volume beyond 100%」這個問答,首先用這個指定列出所有的 source:

pactl list sources

裡面可以看到 AM310 的資料,接著就可以透過 name 的部份指定音量了:

pactl set-source-volume alsa_input.usb-AVerMedia_AVerMedia_AM310_USB_Microphone-00.multichannel-input 70%

放到 startup script 在 login 的時候跑就 OK 了。

用 dig 查瑞士的 top domain 剛好會遇到的 "feature"

Hacker News 上看到「DNS Esoterica - Why you can't dig Switzerland」這篇,裡面提到 dig 的 "feature"。

拿來查 tw 的 NS 會這樣下:

$ dig tw ns

結果會是列出所有的 NS server:

;; ANSWER SECTION:
tw.                     3600    IN      NS      h.dns.tw.
tw.                     3600    IN      NS      a.dns.tw.
tw.                     3600    IN      NS      g.dns.tw.
tw.                     3600    IN      NS      d.dns.tw.
tw.                     3600    IN      NS      anytld.apnic.net.
tw.                     3600    IN      NS      f.dns.tw.
tw.                     3600    IN      NS      b.dns.tw.
tw.                     3600    IN      NS      e.dns.tw.
tw.                     3600    IN      NS      c.dns.tw.
tw.                     3600    IN      NS      ns.twnic.net.

照著作者說的,ukdig uk ns 可以得到類似的結果:

;; ANSWER SECTION:
uk.                     86400   IN      NS      dns1.nic.uk.
uk.                     86400   IN      NS      dns4.nic.uk.
uk.                     86400   IN      NS      nsa.nic.uk.
uk.                     86400   IN      NS      nsb.nic.uk.
uk.                     86400   IN      NS      nsc.nic.uk.
uk.                     86400   IN      NS      nsd.nic.uk.
uk.                     86400   IN      NS      dns3.nic.uk.
uk.                     86400   IN      NS      dns2.nic.uk.

但如果你下 dig ch ns 就會出現錯誤,像是這樣:

; <<>> DiG 9.16.1-Ubuntu <<>> ch ns
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: REFUSED, id: 5019
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;.                              CH      NS

;; Query time: 0 msec
;; SERVER: 127.0.0.1#53(127.0.0.1)
;; WHEN: Fri Jul 15 06:54:24 CST 2022
;; MSG SIZE  rcvd: 28

原因是因為 CH 這個關鍵字是 Chaosnet 的縮寫,而被特殊解讀:

Set the query class. The default class is IN; other classes are HS for Hesiod records or CH for Chaosnet records.

要避開這個解讀需要加上一個 dot (.),採用 FQDN 的方式列出:

dig ch. ns

就會得到正確的結果:

;; ANSWER SECTION:
ch.                     86400   IN      NS      a.nic.ch.
ch.                     86400   IN      NS      b.nic.ch.
ch.                     86400   IN      NS      f.nic.ch.
ch.                     86400   IN      NS      d.nic.ch.
ch.                     86400   IN      NS      e.nic.ch.

另外的方式是 dig -c IN -t NS ch,透過參數的方式讓 dig 不會誤會。

Hacker News 前幾天炸很久的 root cause

前幾天 Hacker News 炸了很久,如果是從 Twitter 上的資料來看,是從 2022/07/08 14:08 UTC 這篇:

中間還原失敗 (2022/07/08 17:35 UTC):

到最後恢復 (2022/07/08 20:48 UTC):

Twitter 這邊的資料看起來差不多是六個小時多,以一個應該是只有 database 需要還原的站台來說的確是蠻久的,所以後續在「HN is up again」這邊就有在討論原因,裡面 HN 的老大 dang 也有提到 downtime 是七個小時多:

8 hours of downtime, but not data loss, since there was no data to lose during the downtime.

Last post before we went down (2022-07-08 12:46:04 UTC): https://news.ycombinator.com/item?id=32026565

First post once we were back up (2022-07-08 20:30:55 UTC): https://news.ycombinator.com/item?id=32026571 (hey, that's this thread! how'd you do that, tpmx?)

So, 7h 45m of downtime. What we don't know is how many posts (or votes, etc.) happened after our last backup, and were therefore lost. The latest vote we have was at 2022-07-08 12:46:05 UTC, which is about the same as the last post.

There can't be many lost posts or votes, though, because I checked HN Search (https://hn.algolia.com/) just before we brought HN back up, and their most recent comment and story were behind ours. That means our last backup on the ill-fated server was taken after the last API update (HN Search relies on our API), and the API gets updated every 30 seconds.

I'm not saying that's a rock-solid argument, but it suggests that 30 seconds is an upper bound on how much data we lost.

另外大家就在找 dang 的回應是什麼 (畢竟是第一手資料),用 Ctrl-F 找一下就看到有趣的猜測,從 32028511 這個節點可以看到這串有趣的討論,首先是 mikeiem

You are never going to guess how long the HN SSDs were in the servers... never ever... OK... I'll tell you: 4.5years. I am not even kidding.

然後是 kabdib 的回應:

Let me narrow my guess: They hit 4 years, 206 days and 16 hours . . . or 40,000 hours.

And that they were sold by HP or Dell, and manufactured by SanDisk.

Do I win a prize?

(None of us win prizes on this one).

接著就是 dang 說他覺得這個猜測很有可能:

Wow. It's possible that you have nailed this.

Edit: here's why I like this theory. I don't believe that the two disks had similar levels of wear, because the primary server would get more writes than the standby, and we switched between the two so rarely. The idea that they would have failed within hours of each other because of wear doesn't seem plausible.

But the two servers were set up at the same time, and it's possible that the two SSDs had been manufactured around the same time (same make and model). The idea that they hit the 40,000 hour mark within a few hours of each other seems entirely plausible.

Mike of M5 (mikiem in this thread) told us today that it "smelled like a timing issue" to him, and that is squarely in this territory.

後續他也從自家的 /newest 裡面撈了相關的資料出來,依照他撈出來的關鍵字,看起來是用 HPE 出的 SSD:

It's also an example of the dharma of /newest – the rising and falling away of stories that get no attention:

HPE releases urgent fix to stop enterprise SSDs conking out at 40K hours - https://news.ycombinator.com/item?id=22706968 - March 2020 (0 comments)

HPE SSD flaw will brick hardware after 40k hours - https://news.ycombinator.com/item?id=22697758 - March 2020 (0 comments)

Some HP Enterprise SSD will brick after 40000 hours without update - https://news.ycombinator.com/item?id=22697001 - March 2020 (1 comment)

HPE Warns of New Firmware Flaw That Bricks SSDs After 40k Hours of Use - https://news.ycombinator.com/item?id=22692611 - March 2020 (0 comments)

HPE Warns of New Bug That Kills SSD Drives After 40k Hours - https://news.ycombinator.com/item?id=22680420 - March 2020 (0 comments)

(there's also https://news.ycombinator.com/item?id=32035934, but that was submitted today)

這次 downtime 看起來很像是中了 SSD firmware bug,目前看起來先搬到 EC2 上面了:

$ host news.ycombinator.com
news.ycombinator.com has address 50.112.136.166
$ host 50.112.136.166      
166.136.112.50.in-addr.arpa domain name pointer ec2-50-112-136-166.us-west-2.compute.amazonaws.com.

看討論串應該是暫時性的?