Python 的 asyncio.create_task() 的設計地雷

今天的 Hacker News Daily 上看到「The Heisenbug lurking in your async code」這篇,HN 的討論則是在「A Heisenbug lurking in async Python (textualize.io)」這。

設計上面 asyncio.create_task() 傳回的物件只有被 weak reference 到,而不是一般的 reference,所以會導致 Python 在 GC 時就真的被收走了:

Important: Save a reference to the result of this function, to avoid a task disappearing mid-execution. The event loop only keeps weak references to tasks. A task that isn’t referenced elsewhere may get garbage collected at any time, even before it’s done. For reliable “fire-and-forget” background tasks, gather them in a collection:

在前一段有提到可以用 asyncio.TaskGroup.create_task() 來做,這也是官方建議的解法,不過這個是 3.11 才新增的功能:

Note: asyncio.TaskGroup.create_task() is a newer alternative that allows for convenient waiting for a group of related tasks.

是個容易忘記然後中雷的東西,畢竟有個功能性接近的 threading,是可以抱持著 fire-and-forget 的心態在用,但這邊不是 threading XD

用 Git 2.38 內建的 Scalar 管理大量的 repository

在「Highlights from Git 2.38」這邊有介紹 Git 2.38 加入了微軟開發的 Scalar,是一個用來管理大量 repository 以及巨大 repository 的工具。

第一次用 scalar register 看起來會先有一些前置作業,在掛 repository 進去的時候就會自動去註冊 timers:

Created symlink /home/gslin/.config/systemd/user/timers.target.wants/git-maintenance@hourly.timer → /home/gslin/.config/systemd/user/git-maintenance@.timer.
Created symlink /home/gslin/.config/systemd/user/timers.target.wants/git-maintenance@daily.timer → /home/gslin/.config/systemd/user/git-maintenance@.timer.
Created symlink /home/gslin/.config/systemd/user/timers.target.wants/git-maintenance@weekly.timer → /home/gslin/.config/systemd/user/git-maintenance@.timer.

這樣看起來應該是每個小時會跑一些東西?文件上看起來是會在背景先去拉一些東西,還有定時跑 GC?

接下來就是把目錄下所有的 git repository 丟進去:

find . -maxdepth 1 -mindepth 1 -type d | xargs -n1 -I% bash -c "cd %; scalar register"

然後可以用 scalar list 看到目前掛了哪些 repository。

Uber 對 Golang GC 的調整

Hacker News 上看到「How We Saved 70K Cores Across 30 Mission-Critical Services (Large-Scale, Semi-Automated Go GC Tuning @Uber)」這篇,講 Uber 的人怎麼調整 GolangGC,在 Hacker News 上的討論「Large-scale, semi-automated Go GC tuning (uber.com)」也有些東西再講。

一開始的方法是動態一直調整 GOGC 的值:

Our initial approach was to have a ticker to run every second to monitor the heap metrics, and then adjust GOGC value accordingly.

但這個方法的 overhead 太重:

The disadvantage of this approach is that the overhead starts to become considerable, because in order to read heap metrics Go needs to do a STW (ReadMemStats) and it is somewhat inaccurate, because we can have more than one garbage collection per second.

後來的方法是利用 SetFinalizer 來做 (然後這段 code 不知道為什麼是用圖片...):

Luckily we were able to find a good alternative. Go has finalizers (SetFinalizer), which are functions that run when the object is going to be garbage collected. They are mainly useful for cleaning memory in C code or some other resources. We were able to employ a self-referencing finalizer that resets itself on every GC invocation. This allows us to reduce any CPU overhead.

不過 Hacker News 上有些人也很驚訝於 30 個 service 用掉 70K cores 這件事情,以 Uber 的服務來說算是比預想多不少數字,而且這只是跑 Golang,而且這次省下來的部份...

另外在 Hacker News 上也有人提到 Golang 有在思考 soft memory limit 的設計,也值得看一看:「runtime/debug: soft memory limit #48409」、「Proposal: Soft memory limit」。

Java 17 (JDK 17),新的 Java LTS 版本 (然後來看 GC)

Java 17 (JDK 17) 釋出,這是 Oracle 本家新的 LTS 版本,引用的是 jdk-dev 的 mailing list:「Java 17 / JDK 17: General Availability」。另外在 Hacker News 上的討論可以翻一下:「Java 17 / JDK 17: General Availability (java.net)」。

上一個 LTS 版本是 Java 11,所以很自然的也會有從 Java 11 之後的新功能說明:「JEPs in JDK 17 integrated since JDK 11」。

對於只是拿來用,而不是拿來開發的人來說,我的重點都放在 JVM 的 GC 效能以及特性。

從 Java 11 預設的 G1GC 來看,可以看到一些改善,從「JEP 345: NUMA-Aware Memory Allocation for G1」(Java 14) 這個看起來會改善 G1GC 在多實體 CPU 的情況下效能,不過看起來有 -XX:+UseNUMA 這個參數要加。

再來是「JEP 346: Promptly Return Unused Committed Memory from G1」(Java 12) 可以在閒閒的時候跑個 GC 把記憶體給 OS。

接下來是兩個新的 GC (相較於 11 版),一個是 ZGC,另外一個是 Shenandoah,都沒有取代 G1GC,但兩個都有對應使用的場景。

ZGC 有列兩個 JEP:「JEP 376: ZGC: Concurrent Thread-Stack Processing」、「JEP 377: ZGC: A Scalable Low-Latency Garbage Collector (Production)」,目標是讓 GC pause time 盡可能的低,另外在 wiki 上面的說明則是有提到目標在 1ms 以下:

The ZGC garbage collector (GC) aims to make GC pauses and scalability issues in HotSpot a thing of the past.

Sub-millisecond max pause times

Shenandoah 列出了「JEP 379: Shenandoah: A Low-Pause-Time Garbage Collector (Production)」,不過先前的「JEP 189: Shenandoah: A Low-Pause-Time Garbage Collector (Experimental)」講的比較詳細,目標是希望 GC 不影響目前正在執行的程式:

Add a new garbage collection (GC) algorithm named Shenandoah which reduces GC pause times by doing evacuation work concurrently with the running Java threads. Pause times with Shenandoah are independent of heap size, meaning you will have the same consistent pause times whether your heap is 200 MB or 200 GB.

可以看出來兩個新的 GC 都是希望降低 pause time,對於 latency 敏感的應用應該都可以測試看看,可以預期整體的 throughput 會低一些。

回頭來看 G1GC,有人跑了 benchmark 測試了 Java 11 與 Java 17 的 G1GC 差異:「How much faster is Java 17?」。

可以看到 G1GC 的改善 (藍色的部份) 看起來還是不少,不過有些情況下是會變慢的。文章裡面還有提到 Parallel GC,這邊就不提了,可以自己看...

等各家 build 出來後來測看看 Cassandra 的效能影響如何...

Facebook 推出了 Hermes,為了 React Native 而生的 JS Engine

Facebook 提供了一個對 React Native 最佳化的 JS engine:「Hermes: An open source JavaScript engine optimized for mobile apps, starting with React Native」。

裡面有提到兩個比較重要的的部份是 No JIT 與 Garbage collector strategy,針對行動裝置的特性而設計:避免 JIT 產生的 overhead,以及降低記憶體使用量。

官方給的改善主要也都是偏這兩塊:

不過沒有提到 CPU usage 會上升多少,只是帶過去:

Notably, our primary metrics are relatively insensitive to the engine’s CPU usage when executing JavaScript code.

對於 Facebook 也許是可以接受的數量,但對於其他人就沒概念了... 要入坑的人自己衡量這部份的風險 XD

Instagram 解決 Cassandra 效能問題的方法

在解決 Cassandra 效能問題中大概就 ScyllaDB 特別有名,用 C++ 重寫一次使得效能大幅改善。而 Instagram 的人則是把底層的資料結構換掉,改用 RocksDB (這公司真的很愛自家的 RocksDB...):「Open-sourcing a 10x reduction in Apache Cassandra tail latency」。

主要原因是他們發現 Cassandra 在處理資料的部份會有 JVM 的 GC 問題,而且是導致 Cassandra 效能差的主要原因:

Apache Cassandra is a distributed database with it’s own LSM tree-based storage engine written in Java. We found that the components in the storage engine, like memtable, compaction, read/write path, etc., created a lot of objects in the Java heap and generated a lot of overhead to JVM.

然後在換完後測試可以看到效能大幅提昇,也可以看到 GC 的延遲大幅降低:

In one of our production clusters, the P99 read latency dropped from 60ms to 20ms. We also observed that the GC stalls on that cluster dropped from 2.5% to 0.3%, which was a 10X reduction!

比較一下這兩者的差異:在 ScyllaDB 是全部都用 C++ 改寫 (資料結構不換),這樣就直接解決掉 JVM 的 GC 問題。在 Rocksandra 則是在 profiling 後挑重點換掉 (這邊看起來是處理資料的 code,直接換成 RocksDB),另外順便把一些界面抽象化... 兩個不一樣的解法,都解決了 JVM 的 GC 問題。

即將出版的 Xdebug 2.6 能觀察 PHP 的 GC 情況了

在「» Feature: Garbage Collection Statistics」這邊看到 Xdebug 2.6 將能夠收集 PHP 的 GC (garbage collection) 行為了:

Xdebug's built-in garbage collection statistics profiler allows you to find out when the PHP internal garbage collector triggers, how many variables it was able to clean up, how long it took, and how how much memory was actually freed.

這樣 profiling 看的東西就更準確了...

Go 1.9 的 GC 改善

Update:被提醒後仔細看了一下,是 1.8 預設生效 (但保留選項切回來 debug),如果沒問題的話 1.9 把舊的方式拔乾淨:

Assuming things go smoothly, we will remove stack re-scanning support when the tree opens for Go 1.9 development.

標題就不改了... 以下原文。

在「Sub-millisecond GC pauses」這邊看到的。Golang 想辦法將 GC 造成的影響降低:「Proposal: Eliminate STW stack re-scanning」。

目標是解決最大的 GC pause 來源:

As of Go 1.7, the one remaining source of unbounded and potentially non-trivial stop-the-world (STW) time is stack re-scanning.

然後拿新的解法來戰,目前初步的測試看起來可以降到 50µs (== 0.05ms):

We propose to eliminate the need for stack re-scanning by switching to a hybrid write barrier that combines a Yuasa-style deletion write barrier [Yuasa '90] and a Dijkstra-style insertion write barrier [Dijkstra '78]. Preliminary experiments show that this can reduce worst-case STW time to under 50µs, and this approach may make it practical to eliminate STW mark termination altogether.

在「runtime: eliminate stack rescanning · Issue #17503 · golang/go」這邊可以看到進度,現在已經在 master branch 上了,看起來會在 1.9 的時候被放出來... 不過 worst case 的時間上修了 XDDD

The high level summary is that this reduces worst-case STW time to about 100 µs and typical 95%ile STW time to 50 µs (assuming, of course, that the OS doesn't get in the way and that the system isn't otherwise overloaded).

但看起來應該還是很大的效能改善,尤其是 CPU bound 的應用?

商業版本的 Zing JVM 對 GC 的改善

在翻「Stuff The Internet Says On Scalability For February 19th, 2016」這邊的資料時看到這篇文章講到 Zing JVM 大幅降低了他們在 C10M 時遇到的 latency 問題:「Fast C10M: MigratoryData running on Zing JVM achieves near 1 Gbps messaging to 10 million concurrent users with 15 millisecond consistent latency」。

包括了平均值、99% 值、最大值都大幅下降:

In this post, we show that by simply replacing the JVM with Zing JVM out-of-the-box (without any tuning), and preserving the same C10M benchmark scenario and setup, we can reduce the average latency from 61 milliseconds to under 15 milliseconds. Moreover, and more importantly, the latency spikes can be significantly reduced from 585 milliseconds to 25 milliseconds for the 99th percentile latency and from 1700 milliseconds to 126 milliseconds for the maximum latency. Therefore, every single message can be delivered, even in the worst case, with almost no delay.

而且他們發現最高的 126ms 也不是 GC 造成的,而是 benchmark 這邊造成的:

And so, the relatively high latency spikes we saw in the previous C10M benchmark were due to JVM’s Garbage Collection (GC). In the new benchmark, not only Zing JVM didn’t introduce high latency spikes, but based on analyzing the logs, it appears that GC effects no longer dominate latency behavior. The dramatically improved 126 ms max latency is not caused by GC but by other condition of the benchmark setup. Anyway, this max latency was so small for a web architecture that we did not spend time to determine at which level exactly it occurred.

Zing 的價錢是一台 USD$8000/year,另外也有虛擬機版本 (另外報價),如果遇到 JVM GC 問題,看起來會是個可以花錢解決問題的方案:

Zing® is priced on a subscription basis per server. With per-server pricing, you don’t need to worry about core counts, memory size, or number of instances deployed per server. The annualized subscription price for Zing per physical server ranges from $8000 (for a single license) to under $2000 (for orders above 1000 servers). Higher volumes and longer subscription terms will reduce the per-server price for Zing. Pricing for virtual servers is also available upon request.