微軟授權讓 exFAT 進 Linux Kernel 的新聞...

最近還蠻紅的新聞之一,Microsoft 官方決定讓 Linux Kernel 可以實做 exFAT:「exFAT in the Linux kernel? Yes!」。公開的規格書在「exFAT file system specification」這邊。

先前一直有 patch,所以技術上一直不是大問題,真正沒進 kernel 的原因之一就是專利,現在微軟的授權也不是開放給所有使用 Linux 的人?而是以 OIN 會員為主:

We also support the eventual inclusion of a Linux kernel with exFAT support in a future revision of the Open Invention Network’s Linux System Definition, where, once accepted, the code will benefit from the defensive patent commitments of OIN’s 3040+ members and licensees.

不知道 Linux 這邊會不會喊卡,感覺不是什麼善意,更像是 PR 性的攻擊...

Netflix 找到的 TCP 實做安全性問題...

這幾天的 Linux 主機都有收到 kernel 的更新,起因於 Netflix 發現並與社群一起修正了一系列 LinuxFreeBSD 上 TCP 實做 MSSSACK 的安全性問題:「https://github.com/Netflix/security-bulletins/blob/master/advisories/third-party/2019-001.md」。

其中最嚴重的應該是 CVE-2019-11477 這組,可以導致 Linux kernel panic,影響範圍從 2.6.29 開始的所有 kernel 版本。能夠升級的主機可以直接修正,無法升級的主機可以參考提出來的兩個 workaround:

Workaround #1: Block connections with a low MSS using one of the supplied filters. (The values in the filters are examples. You can apply a higher or lower limit, as appropriate for your environment.) Note that these filters may break legitimate connections which rely on a low MSS. Also, note that this mitigation is only effective if TCP probing is disabled (that is, the net.ipv4.tcp_mtu_probing sysctl is set to 0, which appears to be the default value for that sysctl).

Workaround #2: Disable SACK processing (/proc/sys/net/ipv4/tcp_sack set to 0).

第一個 workaround 是擋掉 MSS 過小的封包,但不保證就不會 kernel panic (文章裡面用語是 mitigation)。

第二個 workaround 是直接關掉 SACK,這組 workaround 在有 packet loss 的情況下效能會掉的比較明顯,但看起來可以避免直接 kernel panic...

Percona 對於 PostgreSQL 使用 HugePages 的評論

開頭我先說一下我的想法,我對於 Percona 的 Ibrar Ahmed 的文章保持著懷疑的態度,因為他先前在「Benchmark PostgreSQL With Linux HugePages」這篇做的 benchmark 就有奇怪的結果,但卻給不出合理的原因,甚至連 Percona 自家的 CEO 公開在 comment 問之後也沒有看到文章提出合理的解釋:

Hi,

A lot of interesting results here…

1) PgBench access distribution is very interesting. With database size growing by 20% from 80G to 96G we see performance drop of Several times which is very counter-intuitive

2) There is no difference between 2MB and 4K but huge difference between 1G and 2M even though I would expect at least some TLB miss reduction in the first transitioning. I would understand it in case transparent huge pages are Enabled… but not disabled

3) For 96GB why would throughput grow with number of clients for 1G but fall for 2M and 4KB.

這次看到「Settling the Myth of Transparent HugePages for Databases」這篇,也是在討論 Linux 的 HugePages 對 PostgreSQL 帶來的影響,同樣馬上又看到奇怪的東西...

首先是標示與圖片不合:

Figure 1.1 PostgreSQL’ s Benchmark, 10 minutes execution time where database workload(48GB) < shared_buffer (64GB)

Figure 1.2 PostgreSQL’ s Benchmark, 10 minutes execution time where database workload (48GB) > shared_buffer (64GB)

不過這邊可以推測 Figure 1.2 應該是 112GB (因為對應的圖片上面標的是 112GB),當做是標錯就好。

但這樣又跑出一個奇怪的結果,48GB 的資料量比較小,TPS 大約是 35K/33K/41K,但 112GB 資料量比較大,卻可以達到 39K/43K/41K~42K,反而比較快?我暫時想不到什麼理由...

整體的測試有 pgbench 與 sysbench (這邊也打錯成 sysbecnch,先不管),其中 pgbench 跑了 10 mins 與 60 mins 的版本,但是 sysbench 只跑了 10 mins 的版本?這是什麼原因...

另外還是有些情況是打開 HugePages 比較快的 (sysbench 的 64 clients),如果以直覺來說的話,我反而還是會打開 HugePages (yeah 純粹是直覺),我現在比較想知道他會在 Percona 裡面待多久...

PostgreSQL 對 fsync() 的修正

上次寫了「PostgreSQL 對 fsync() 的行為傷腦筋...」提到 fsync() 有些地方是與開發者預期不同的問題,但後面忘記跟進度...

剛剛看到 Percona 的人寫了「PostgreSQL fsync Failure Fixed – Minor Versions Released Feb 14, 2019」這篇才發現在 2/14 就出了對應的更新,從 release notes 也可以看到:

By default, panic instead of retrying after fsync() failure, to avoid possible data corruption (Craig Ringer, Thomas Munro)

Some popular operating systems discard kernel data buffers when unable to write them out, reporting this as fsync() failure. If we reissue the fsync() request it will succeed, but in fact the data has been lost, so continuing risks database corruption. By raising a panic condition instead, we can replay from WAL, which may contain the only remaining copy of the data in such a situation. While this is surely ugly and inefficient, there are few alternatives, and fortunately the case happens very rarely.

A new server parameter data_sync_retry has been added to control this; if you are certain that your kernel does not discard dirty data buffers in such scenarios, you can set data_sync_retry to on to restore the old behavior.

現在的 workaround 是遇到 fsync() 失敗時為了避免 data corruption,會直接 panic 讓整個 PostgreSQL 從 WAL replay 記錄,也代表 HA 機制 (如果有設計的話) 有機會因為這個原因被觸發...

不過也另外設計了 data_sync_retry,讓 PostgreSQL 的管理者可以硬把這個 panic 行為關掉,改讓 PostgreSQL 重新試著 fsync(),這應該是在之後 kernel 有修改時會用到...

PostgreSQL 對 fsync() 的行為傷腦筋...

FOSDEM 2019 上的演講,討論 PostgreSQL 在確保 ACID 特性中的 Durability 時遇到 fsync() 的行為跟預想的不一樣 (主要是當 fsync() 失敗的行為):「PostgreSQL vs. fsync」。

在「PostgreSQL vs. fsync. How is it possible that PostgreSQL used fsync incorrectly for 20 years, and what we'll do about it.」這邊的 Q&A 形式的訪談有快速描述了短期的計畫與長期的想法:

The short-term solution is ensuring that we detect fsync errors reliably at least on sufficiently recent kernels (since 4.13). On older kernels we can’t do much better, unfortunately.

The long-term solution is still being discussed in the community, but it’s hard to say how we could keep relying on buffered I/O in the future. So we may end up with direct I/O, but that’s a pretty significant change and is likely going to be a multi-year project.

MySQL 這邊則是以 O_DIRECT 為主的世界,受到的影響就小很多了...

Linux Kernel 4.20 修正了一卡車 Intel CPU bug,然後效能掉光了...

看到「Bisected: The Unfortunate Reason Linux 4.20 Is Running Slower」這篇測試了目前還在 RC 的 4.20.0,可以看到 AMD 的效能沒有太大影響,但 Intel i9 的效能掉了很嚴重:

從說明可以看到有測出 30%~50%:

This ranged from Rodinia scientific OpenMP tests taking 30% longer to Java-based DaCapo tests taking up to ~50% more time to complete to code compilation tests taking measurably longer to lower PostgreSQL database server performance to longer Blender3D rendering times.

另外在其他 Intel CPU 上測試也發現不是只有 i9 有影響,低階的機器也是:

Those affected systems weren't high-end HEDT boxes but included a low-end Core i3 7100 as well as a Xeon E5 v3 and Core i7 systems.

透過 bisect 有找到是哪個 commit 造成的:

That change is "STIBP" for cross-hyperthread Spectre mitigation on Intel processors. STIBP is the Single Thread Indirect Branch Predictors (STIBP) allows for preventing cross-hyperthread control of decisions that are made by indirect branch predictors.

但這又是屬於 security patch,不太能關... 加上自從 MeltdownSpectre 後,讓安全研究人員發現了全新的天地,之後應該只會愈來愈慘 :o

又一個 TCP BBR 的測試結果

TCP BBRGoogle 發表的 TCP congestion control 演算法,是一個純伺服器端就能夠改善 TCP 壅塞處理的機制。在 Linux Kernel 4.9 之後被納入了。

Spotify 有大量資料要傳到使用者端 (像是音檔),剛好是 TCP BBR 改善的對象之一,實際測試後得到了很不錯的改善數據:「Smoother Streaming with BBR」。

Spotify 公佈的資料沒有提到平台,所以先稍微了解一下他的音質,也就是「Audio settings」這篇。

在 Desktop 是 160kbps/320kbps Ogg (Standard/HQ)。在 Web Player 則是 128kbps/256kbps AAC (Standard/HQ)。

行動平台部份比較複雜,在 iOS 上是 96kbps/160kbps/256kbps Ogg (Normal/High/Extreme),另外有 Automatic 自動調整的設定。在 Android 平台則是 24kbps HE-AACv2 (Low) 與 96kbps/160kbps/320kbps Ogg (Normal/High/Very high) 以及 Automatic。

而最後 Chromecast 則是 128kbps/256kbps (Standard/Premium)。

測試時可以發現 shutter (指跟不上播放速度) 的情況降低了 6%~10%,而且下載速度增加了 5%~7% (對於慢速的裝置改善更多,10%~15%):

Taking daily averages, stutter decreased 6-10% for the BBR group. Bandwidth increased by 10-15% for the slower download cohorts, and by 5-7% for the median. There was no difference in latency between groups.

而各地區的差異也可以看出來改善很多:

另外他們在測試時,剛好遇到秘魯的機房連外發生問題,結果意外發現 BBR 還是可以穩定在這種網路環境下運作:

In Peru, the non-BBR group saw a 400-500% increase in stutter. In the BBR group, stutter only increased 30-50%.

In this scenario, the BBR group had 4x bandwidth for slower downloads (the 10th percentile), 2x higher median bandwidth, and 5x less stutter!

Ubuntu 18.04 上可以直接設定 BBR,在 Ubuntu 16.04 則可以參考「Ubuntu 16.04 用 speedtest-cli 測試 TCP BBR 效能」這篇的方式升級 kernel 後設定 BBR。

WireGuard 被收進 Linux Kernel 了...

Twitter 上看到 WireGuard 被收進 Linux Kernel 了,等 review 完後就就會正式納入了... 而且 Linus 給的評價還蠻高的:

找了一下起源是「[PATCH v1 3/3] net: WireGuard secure network tunnel」這邊,而 Twitter 上面引用的是「Re: [GIT] Networking」這篇...

DTrace 用 GPL 授權釋出了...

在「dtrace for linux; Oracle does the right thing」這邊看到 DTraceGPL 釋出的消息 (本來是 CDDL,與 Linux Kernel 用的 GPLv2 不相容)。

這次不僅是釋出,還提供了整合到 Linux Kernel 的 commit。

不過我記得在 DTrace 出來後,Linux 社群發展不少類似的工具... 不知道這次的釋出會有什麼樣的交互影響。

KPTI (Meltdown Mitigation) 對 MyISAM 的痛點

MariaDB 的「MyISAM and KPTI – Performance Implications From The Meltdown Fix」這篇看到頗驚人的數字,這篇提到了他們收到回報 (回報的 ticket 可以參考「[MDEV-15072] Massive performance impact after PTI fix - JIRA」),說 KPTI (Meltdown Mitigation) 對 MyISAM 效能影響巨大:

Recently we had a report from a user who had seen a stunning 90% performance regression after upgrading his server to a Linux kernel with KPTI (kernel page-table isolation – a remedy for the Meltdown vulnerability).

他們發現 90% 是因為 VMware 舊版本無法使用 CPU feature 加速,在新版應該可以改善不少。但即使如此,文章內還是在實體機器上看到了 40% 的效能損失:

A big deal of those 90% was caused by running in an old version of VMware which doesn’t pass the PCID and INVPCID capabilities of the CPU to the guest. But I could reproduce a regression around 40% even on bare metal.

然後後面就在推銷 MariaDB 的 Aria Storage Engine 了,不是那麼重要... 不過知道 MyISAM 在 KPTI 下這麼傷還蠻重要的,因為接下來五年應該都還是愈的到 KPTI,應該還是有人在用 MyISAM...