Microsoft Authenticator 的長年 bug

在「Flaw has Microsoft Authenticator overwriting MFA accounts, locking users out (csoonline.com)」這邊看到的,原文在「Design flaw has Microsoft Authenticator overwriting MFA accounts, locking users out」,在講 Microsoft Authenticator (Android 版iOS 版) 這個支援 TOTP 的 MFA 程式的長年 bug... (對一般人比較好理解的,這是六位數字的動態密碼 app)

會造成無法登入的 bug 是因為透過 QR code scan 加入新的帳號時,會蓋掉既有的帳號資料,所以產生的 QR code 就無法在舊的帳號/網站上面使用了:

That’s because, due to an issue involving which fields it uses, Microsoft Authenticator often overwrites accounts when a user adds a new account via QR scan — the most common method of doing so.

原因是因為 username 相同就會蓋掉,而大多數人在不同的地方都會用同樣的 username (像是我的 gslin):

The core of the problem? Microsoft Authenticator will overwrite an account with the same username. Given the prominent use of email addresses for usernames, most users’ apps share the same username. Google Authenticator and just about every other authenticator app add the name of the issuer — such as a bank or a car company — to avoid this issue. Microsoft only uses the username.

然後 workaround 是不要用 Microsoft Authenticator,或是不要用 QR code scan:

There are multiple workarounds. The easiest is for companies to use any other authentication app. Not using the QR code scan feature — and manually entering the code — will also sidestep the issue, which doesn’t appear to arise when the authenticated accounts belong to Microsoft.

然後這個問題可以找到 2020 年開始有人抱怨,但作者測試看起來 2016 年的版本就已經是這樣了:

CSO Online found complaints of this problem dating back to 2020, but it appears to have been in place since Microsoft Authenticator was released in June 2016. (For historical context, Google was the first Authenticator app, having been launched in 2010.)

然後 Microsoft 確認有這樣的行為,但不認為是 bug 而是 feature (怎麼梗圖突然從腦袋裡冒出來...):

Microsoft confirmed the issue but said it was a feature not a bug, and that it was the fault of users or companies that use the app for authentication.

然後專欄作者找了其他專家測試其他的 app,可以發現只有 Microsoft Authenticator 的處理是 override 然後炸掉:

By the way, I’ve tested this behavior in 14 other authenticator apps so far. None of them exhibit the same collision behavior that Microsoft Authenticator does,” he added. “I gave up at 14 because at that point, it’s obvious Microsoft are the ones who are doing things poorly here.

大概是大家都懶得吵了,反正可以用 Google Authenticator 或是其他 TOTP app...

擾人的 lazy loading

維基百科上面有時候會想要看其他語言的頁面,以 Kalafina 這頁來說,想要點開跨國語系的 menu 點日文版頁面的時候,會出現這樣的情況:

游標移上去要點的時候,因為 lazy loading 跑完;DOM 被插入新的 node,造成本來想要的日文版連結跑到下面,然後按快一點的人就點到其他連結... 目前的解法是直接用 uBlock Origin 擋掉那個 node,讓他不要出現:

wikipedia.org##.cx-uls-relevant-languages-banner

這種類型的 UX flaw 在現代愈來愈多了,而且還是各 framework 都推薦的寫法... (各種 useEffect() 類的功能,頁面的結構先出來,再後續讀到內容時再更新頁面)

看到「Creating Perfect Font Fallbacks in CSS」這邊這段時想到的:

With font-display: swap, the browser will first render your text using the fallback typeface you specified in the font-family property[.]

font swapping 也是個讓人吐血的 UX flaw...

nginx 分家:freenginx

Hacker News 上看到 Maxim Dounin 決定分家到 freenginx 的消息:「Freenginx: Core Nginx developer announces fork (nginx.org)」,原文在 mailing list 上:「announcing freenginx.org」,這邊提到分家的原因:

Unfortunately, some new non-technical management at F5 recently decided that they know better how to run open source projects. In particular, they decided to interfere with security policy nginx uses for years, ignoring both the policy and developers’ position.

在 freenginx 的 mailing list 上有提到更多,在 2024-February/000007.html 這篇:

The most recent "security advisory" was released despite the fact that the particular bug in the experimental HTTP/3 code is expected to be fixed as a normal bug as per the existing security policy, and all the developers, including me, agree on this.

And, while the particular action isn't exactly very bad, the approach in general is quite problematic.

這邊提到的 security advisory 是「[nginx-announce] nginx security advisory (CVE-2024-24989, CVE-2024-24990)」這個,看起來是個沒有 enabled by default 的功能:

Two security issues were identified in nginx HTTP/3 implementation,
which might allow an attacker that uses a specially crafted QUIC session
to cause a worker process crash (CVE-2024-24989, CVE-2024-24990) or
might have potential other impact (CVE-2024-24990).

The issues affect nginx compiled with the ngx_http_v3_module (not
compiled by default) if the "quic" option of the "listen" directive
is used in a configuration file.

The issue affects nginx 1.25.0 - 1.25.3.
The issue is fixed in nginx 1.25.4.

id=39373804 這邊有些目前 nginx 組成的資訊可以讀,目前 nginx 的 core devs 應該就三位 (在 Insights/Contributors 這邊看起來只有兩位,這是因為 GitHub 上面的 mirror 看起來是從 Mercurial 同步過去的,而 Sergey Kandaurov 沒有 GitHub 帳號):

Worth noting that there are only two active "core" devs, Maxim Dounin (the OP) and Roman Arutyunyan. Maxim is the biggest contributor that is still active. Maxim and Roman account for basically 99% of current development.

So this is a pretty impactful fork. It's not like one of 8 core devs or something. This is 50% of the team.

Edit: Just noticed Sergey Kandaurov isn't listed on GitHub "contributors" because he doesn't have a GitHub account (my bad). So it's more like 33% of the team. Previous releases have been tagged by Maxim, but the latest (today's 1.25.4) was tagged by Sergey

現在就是單方面的說法,可以再讓子彈多飛一點時間... 看 F5 要不要回應,以及 F5 的說法 (如果要回應的話)。

VirtualBox 內的 Windows 上傳速度很慢的問題

因為我電腦有兩張網卡,兩條線分別接到自己拉的 HiNet 以及社區網路 (不過出去也是 HiNet,這是另外一回事了)。

我桌機的預設 routing 是走自己拉的 HiNet,但我希望 VM 是走社區網路,所以用 bridge mode 設定到網卡上,用 DHCP 取得分享器給的 private IP。

之前一直都沒注意到,前幾天用 Line 傳照片的時候很慢 (之前就有發生了,一直忘記去追問題),花了點時間追問題的時候發現是 VM 裡面的 Windows 10 上傳很慢,這點可以從 Speedtest 的測試結果看到:

先講最後的結論,在交叉測了很多組合後,我發現遇到的問題是把網卡裡的 Large Send Offload (IPv4) (也就是 LSO) 從 Enabled 改成 Disabled

回到當時抓問題的情況,當時先用筆電與 host 測試都沒看到問題,所以看起來應該是 VM 裡面的狀況,但不確定是什麼情況,畢竟不是斷掉...

由於下載速度正常,只有上傳速度卡住,一開始想到的是跟 MTU 相關的問題,所以找了指令降到 1400 後測試,還是一樣...

後來先把 VM 的網路改成 NAT,再測試上傳速度就正常了...

接著想要換個網路卡類型看看,結果卡在找不到 driver。

本來已經想拿 tcpdump 出來追了,但想說先去看看 Windows 10 網卡設定裡面的設定,結果看到 LSO... 就先關看看 (算是以前在 FreeBSD 以及 Linux 下的經驗?)。

然後一關就正常了,交叉再開關兩次確認這個參數有影響,就肯定這個 workaround 應該是有效了...

另外在自己找完問題後,在「Virtualbox 7.0.12 slow upload speed in any Guest OS」這邊看到了類似的問題以及同樣的 workaround。

LSO 過了十幾年還是...

AMD Zen 3 與 Zen 4 上 FSRM (Fast Short REP MOV) 的效能問題

前幾天 Hacker News 上討論到的一篇:「Rust std fs slower than Python? No, it's hardware (xuanwo.io)」,原文則是在「Rust std fs slower than Python!? No, it's hardware!」。

原因是作者收到回報,提到一段 Rust 寫的 code (在文章裡面的 read_file_with_opendal(),透過 OpenDAL 去讀) 比 Python 的 code 還慢 (在文章裡面的 read_file_with_normal(),直接用 Python 的 open() 開然後讀取)。

先講最後發現問題是 Zen 3 (桌機版 5 系列的 CPU) 與 Zen 4 (桌機版 7 系列的 CPU) 這兩個架構上 REP MOV 系列的指令在某些情境下 (與 offset 有關) 有效能上的問題。

FSRM 類的指令被用在 memcpy()memmove() 類的地方,算是很常見備用到的功能,這次追蹤的問題發現在 glibc 裡面用到導致效能異常。

另外也可以查到在 Linux kernel 裡面也有用到:「Linux 5.6 To Make Use Of Intel Ice Lake's Fast Short REP MOV For Faster memmove()」,所以後續應該也會有些改善的討論...

Ubuntu 這邊的 issue ticket 開在「Terrible memcpy performance on Zen 3 when using rep movsb」這,上游的 glibc 也有對應的追蹤:「30995 – Zen 4: sub-optimal memcpy on very large copies」。

從作者私下得知的消息,因為 patch space 的大小限制,AMD 可能無法提供 CPU microcode 上的 patch,直接解決問題:

However, unverified sources suggest that a fix via amd-ucode is unlikely (at least for Zen 3) due to limited patch space. If you have more information on this matter, please reach out to me.

所以目前比較可行的作法是在 glibc 裡面使用到 FSRM 的地方針對 Zen 3 與 Zen 4 放 workaround,回到原來沒有 FSRM 的方式處理:

Our only hope is to address this issue in glibc by disabling FSRM as necessary. Progress has been made on the glibc front: x86: Improve ERMS usage on Zen3. Stay tuned for updates.

另外在追蹤問題的過程遇到不同的情境,得拿出不同的 profiling 工具出來用,所以也還蠻值得看過一次有個印象:

一開始的 timeit 算是 Python 裡面簡單的 benchmark library:

接著的比較是用 command line 的工具 hyperfine 產生出來的 (給兩個 command 讓他跑),查了一下發現在 Ubuntu 官方的 apt repository 裡面有包進去 (22.04+):

再來是用 strace 追問題,這個算是經典工具了,可以拿來看 syscall 被呼叫的時間點:

到後面出現了 perf 可以拿來看更底層的資訊,像是 CPU 內 cache 的情況:

接續提到的「hotspot ASM」應該也還是 perf 輸出的格式,不過不是那麼確定... 在「perf Examples」這邊可以看到 function 的分析:

而文章裡的則是可以看到已經到 assembly 層級了:

差不多就這些...

Firefox 的 :has() 支援度問題

Can I Use... 上面的「:has() CSS relational pseudo-class」裡面可以看到 Firefox 一直都還沒支援,只有在 nightly 版本上面預設有開,這也代表還沒辦法很穩定的用這個 selector... (除非你直接忽略掉 Firefox)。

看到「Prodding Firefox to Update :has() Selection」這篇,Eric Meyer 在抱怨 Firefox 的這個問題,就算是 nightly 的版本也還是有奇怪的 bug,會抓不到 :has() 條件,需要用 contenteditable 的 workaround 觸發 Firefox 的計算。

MozillaBugzilla 上可以看到票還開著:「Implement the :has() pseudo class」,看起來這陣子是一直有在更新,應該是有資源在上面追進度。

等到真的支援,就可以在 querySelectorAll() 裡面直接用 :has() 了,現在 Firefox 的情況讓 Userscript 寫起來有點卡...

把 RabbitMQ 換成 PostgreSQL 的那篇文章...

Hacker News 上看到「SQL Maxis: Why We Ditched RabbitMQ and Replaced It with a Postgres Queue (prequel.co)」這篇文章,原文在「SQL Maxis: Why We Ditched RabbitMQ And Replaced It With A Postgres Queue」這邊,裡面在講他們把 RabbitMQ 換成 PostgreSQL 的前因後果。

文章裡面可以吐嘈的點其實蠻多的,而且在 Hacker News 上也有被點出來,像是有人就有提到他們遇到了 bug (或是 feature) 卻不解決 bug,而是決定直接改寫成用 PostgreSQL 來解決,其實很怪:

In summary -- their RabbitMQ consumer library and config is broken in that their consumers are fetching additional messages when they shouldn't. I've never seen this in years of dealing with RabbitMQ. This caused a cascading failure in that consumers were unable to grab messages, rightfully, when only one of the messages was manually ack'ed. Fixing this one fetch issue with their consumer would have fixed the entire problem. Switching to pg probably caused them to rewrite their message fetching code, which probably fixed the underlying issue.

另外一個吐嘈的點是量的部份,如果就這樣的量,用 PostgreSQL 降低使用的 tech stack 應該是個不錯的決定 (但另外一個問題就是,當初為什麼要導入 RabbitMQ...):

>To make all of this run smoothly, we enqueue and dequeue thousands of jobs every day.

If you your needs aren't that expensive, and you don't anticipate growing a ton, then it's probably a smart technical decision to minimize your operational stack. Assuming 10k/jobs a day, thats roughly 7 jobs per minute. Even the most unoptimized database should be able to handle this.

在同一個 thread 下面也有人提到這個量真的很小,甚至直接不講武德提到可以用 Jenkins 解 XD:

Years of being bullshitted have taught me to instantly distrust anyone who is telling me about how many things they do per day. Jobs or customers per day is something to tell you banker, or investors. For tech people it’s per second, per minute, maybe per hour, or self aggrandizement.

A million requests a day sounds really impressive, but it’s 12req/s which is not a lot. I had a project that needed 100 req/s ages ago. That was considered a reasonably complex problem but not world class, and only because C10k was an open problem. Now you could do that with a single 8xlarge. You don’t even need a cluster.

10k tasks a day is 7 per minute. You could do that with Jenkins.

然後意外看到 Simon Willison 提到了一個重點,就是 RabbitMQ 到現在還是不支援 ACID 等級的 job queuing (尤其是 Durability 的部份),也就是希望 MQ 系統回報成功收到的 task 一定會被處理:

The best thing about using PostgreSQL for a queue is that you can benefit from transactions: only queue a job if the related data is 100% guaranteed to have been written to the database, in such a way that it's not possible for the queue entry not to be written.

Brandur wrote a great piece about a related pattern here: https://brandur.org/job-drain

He recommends using a transactional "staging" queue in your database which is then written out to your actual queue by a separate process.

這也是當年為什麼用 MySQL 幹類似的事情,要 ACID 的特性來確保內容不會掉。

這也是目前我覺得唯一還需要用 RDBMS 當 queue backend 的地方,但原文公司的想法就很迷,遇到 library bug 後決定換架構,而不是想辦法解 bug,還很開心的寫一篇文章來宣傳...

uBlock Origin 1.48.0 的改善

Hacker News 上看到「uBlock Origin 1.48 adds readiness status, code viewer, and other fixes (github.com/gorhill)」這則消息,uBlock Origin 在 1.48 有個蠻重要的 UI/UX 改善 (Readiness status at browser launch)。

uBlock Origin 預設會搭配「工人智慧」維護的列表,這些列表通常都不小,在剛開瀏覽器,還在讀取的過程中去看網站會遇到阻擋不完整的情況。

先前沒有辦法知道這個問題,在這版加上了對應的 icon color 來解決,黃色表示還在讀:

這時候跑去逛網站的話會出現驚嘆號:

讀取完後 icon 會變成標準的紅色,但驚嘆號仍然會留著,表示這個頁面未必有完整過濾:

正常有阻擋的則是這樣:

理論上可以減少 bug report XDDD

To reduce the number of reports caused by this issue which is outside of uBO's control, uBO's toolbar icon will now reflect its readiness status at browser launch.

Mac 會自己改變 Desktop 位置的問題

以前好像沒遇過,換了 M1 以後才注意到 desktop 位置位自己被改變,覺得很阿雜... 找了資料才發現是個 "feature":「How to prevent Mac from changing the order of Desktops/Spaces」。

關掉就好了,網路上的資料最早出現在 2018 年左右,大概是那個時候被加進去的?

GNU Make 在 4.4 引入的 --shuffle

Hacker News 首頁上看到的,作者送了一個提案到 GNU Make,後來被採用,在 4.4 版引入了 --shuffle 指令:「New make --shuffle mode」。

這個功能主要是想要找出在 Makefile 裡面沒有被定義好,平常是因為 side effect 而沒有出錯的地方。

像是作者就發現 libgfortran 沒有把 libquadmath 放到 dependency 的問題:

For example gcc’s libgfortran is missing a libquadmath build dependency. It is natural not to encounter it in real world as libquadmath is usually built along with other small runtimes way before g++ or gfortran is ready.

他的基本想法是把 target 的順序打亂掉,也就是在有指定 --shuffle 時,不一定會照 a -> b -> c 的順序往下遞迴,而可能會是 c -> b -> a 或是其他的順序:

all: a b c

這樣對於抓那些在 -j 平行編譯時會出包的套件也很有幫助,不需要在 -j 開很大的情況下才能重製問題,而是平常就有機會在 CI 環境下被抓出來。