Golang 的排序演算法將換成 pdqsort,LLVM libc++ 換成 BlockQuicksort

Hacker News 首頁上看到的消息,Golang 將會把 sort.Sort() 換成 pdqsort (Pattern-defeating Quicksort):「Go will use pdqsort in next release (github.com/golang)」,對應的 commit 則是在「sort: use pdqsort」這邊可以看到。

然後另外是「Changing std:sort at Google’s scale and beyond (danlark.org)」這邊提到了,LLVMlibc++std::sortQuicksort 換成 BlockQuicksort。另外在文章裡面有提到一段 Knuth 老大在 TAOCP 裡講 sorting algorithm 沒有霸主的情況:

It would be nice if only one or two of the sorting methods would dominate all of the others, regardless of application or the computer being used. But in fact, each method has its own peculiar virtues. […] Thus we find that nearly all of the algorithms deserve to be remembered, since there are some applications in which they turn out to be best.

先回到 pdqsort 的部份,pdqsort 作者的 GitHub 上 (orlp/pdqsort) 可以看到他對 pdqsort 的說明:

Pattern-defeating quicksort (pdqsort) is a novel sorting algorithm that combines the fast average case of randomized quicksort with the fast worst case of heapsort, while achieving linear time on inputs with certain patterns.

看名字也可以知道 pdqsort 是從 Quicksort 改良的版本,而依照 Golang 的 commit 上的測試,與 Quicksort 相比,少數情況下會慢一點點,大多數的情況下會快一些,而在特殊情境下會讓 worst case 下降。

Golang 選擇把 unstable 的 Quicksort 換成 pdqsort,LLVM 則是選擇把 Quicksort 換成 BlockQuicksort,這邊看起來有些分歧...

反倒是各個程式語言對於 stable 的 Mergesort 陸陸續續都換成了 Timsort,看起來比較像是有個共識...

sscanf() 與 strlen() 的故事繼續發展

昨天在「GTA 的啟動讀取效能問題」這邊提到了 sscanf()strlen() 的問題,剛剛在 Hacker News Daily 上又看到一篇「It Can Happen to You」,在講他自己的專案也中獎。

他提到了一個解法,用 strtof() 取代 sscanf() 讀數字,結果大幅降低了 parsing 的時間:

Replacing the sscanf call with strtof improved startup by nearly a factor of 10: from 1.8 seconds to 199 milliseconds.

文章的最後面題到了不少目前正在進行中的討論與 patch。

首先是 FreeBSD 上的 patch 已經在測試:「address a performance problem w/ partial sscanf on long strings...」,裡面可以看到有很小心的在研究會不會造成 performance regression。

然後是 glibc 這邊,在 2014 年就有被開了一張票提出來:「Bug 17577 - sscanf extremely slow on large strings」,不過下面只是多了幾個 comment,目前沒有任何進度。

然後是 cppreference.com 上的「std::scanf, std::fscanf, std::sscanf」頁面則是加注了複雜度的問題:


Not guaranteed. Notably, some implementations of sscanf are O(N), where N = std::strlen(buffer) [1]. For performant string parsing, see std::from_chars.


GTA 的啟動讀取效能問題


起因是 GTA Online 的遊戲開啟速度很慢,而有人一路 reverse engineering 找出問題並且解決:「How I cut GTA Online loading times by 70%」,對應的 Hacker News 討論有提到其他有趣的事情也可以看看:「How I cut GTA Online loading times by 70% (nee.lv)」。

作者的電腦不算太差,但光開啟 GTA Online 就需要六分鐘,網路上甚至有辦投票蒐集大家的等待時間,發現也有很多人反應類似的問題:

接下來就開始 reverse engineering 了,先觀察各種狀態後發現是卡在 CPU,而不是網路或 Disk I/O,然後就拿出 Luke Stackwalker 這個工具 profiling,不過因為沒有 debug symbol 幫忙 group,所以只能人工判斷後,可以看到兩個問題:

第一個問題發現效能是卡在 strlen(),而 call stack 可以看出來是從 sscanf() 一路打進去的:

反追發現是在處理 10MB 的 JSON 檔造成的,裡面 sscanf() 因為拉出 strlen(),於是就造成把整個 10MB 的 JSON 掃過很多次 (一開始是 10MB,掃到後面會愈來愈少,平均下來應該是 5MB):

第二個問題產生的時間會在第一個問題跑完後,另外看問題的性質,應該跟第一個 JSON 處理有關,他會把 JSON 處理過的資料丟進 array,每個 entry 長這樣:

struct {
    uint64_t *hash;
    item_t   *item;
} entry;

丟進 array 是 OK 的,但問題在於他需要判斷 entry 是否重複,卻沒有用 hash 或是 tree 的結構,而這邊大約有 63k 筆資料,用 array 實做就產生了 O(n^2) 的演算法:

But before it’s stored? It checks the entire array, one by one, comparing the hash of the item to see if it’s in the list or not. With ~63k entries that’s (n^2+n)/2 = (63000^2+63000)/2 = 1984531500 checks if my math is right. Most of them useless. You have unique hashes why not use a hash map.

作者在 PoC 的章節裡面描述他怎麼解這兩個問題。

第一個問題比較好的解法是修正 JSON Parser,但這太複雜,所以他用 workaround 解:把 strlen() 包起來,針對長字串加上一層 cache:

  • hook strlen
  • wait for a long string
  • “cache” the start and length of it
  • if it’s called again within the string’s range, return cached value


And as for the hash-array problem, it’s more straightforward - just skip the duplicate checks entirely and insert the items directly since we know the values are unique.

整個開啟的速度從六分鐘降到一分五十秒,還是偏慢,但算是大幅緩解的 GTA Online 啟動速度的問題了。

不過故事到這邊還沒結束,有人一路去挖,發現其實 sscanf() 的效能地雷已經不是第一次了:YAML 的 Parser 也中過一樣的問題:「Parsing can become accidentally quadratic because of sscanf」,這篇也一樣上了 Hacker News:「Parsing can become accidentally quadratic because of sscanf (github.com/biojppm)」。

然後這又帶出了六年前在 StackOverflow 上就有人問過這個問題:「Why is glibc's sscanf vastly slower than fscanf on Linux?」。


JdeBP 3 days ago

I found this while making a collection of what C implementation does what at https://news.ycombinator.com/item?id=26298300.

There are two basic implementation strategies. The BSD (FreeBSD and OpenBSD and more than likely NetBSD too), Microsoft, GNU, and MUSL C libraries use one, and suffer from this; whereas the OpenWatcom, P.J. Plauger, Tru64 Unix, and my standard C libraries use another, and do not.

The 2002 report in the comp.lang.c Usenet newsgroup (listed in that discussion) is the earliest that I've found so far.

後續的更新動作可以再追一下進度 (包括 GTA Online 與各家的 libc)。

CVE-2015-7547:getaddrinfo() 的 RCE (Remote Code Execution) 慘案

Google 寫了一篇關於 CVE-2015-7547 的安全性問題:「CVE-2015-7547: glibc getaddrinfo stack-based buffer overflow」。

Google 的工程師在找 OpenSSH 連到某台特定主機就會 segfault 的通靈過程中,發現問題不在 OpenSSH,而是在更底層的 glibc 導致 segfault:

Recently a Google engineer noticed that their SSH client segfaulted every time they tried to connect to a specific host. That engineer filed a ticket to investigate the behavior and after an intense investigation we discovered the issue lay in glibc and not in SSH as we were expecting.

由於等級到了 glibc 這種每台 Linux 都有裝的情況,在不經意的情況下發生 segfault,表示在刻意攻擊的情況下可能會很糟糕,所以 Google 投入了人力研究,想知道這個漏洞到底可以做到什麼程度:

Thanks to this engineer’s keen observation, we were able determine that the issue could result in remote code execution. We immediately began an in-depth analysis of the issue to determine whether it could be exploited, and possible fixes. We saw this as a challenge, and after some intense hacking sessions, we were able to craft a full working exploit!

在研究過程中 Google 發現 Red Hat 的人也在研究同樣的問題:「(CVE-2015-7547) - In send_dg, the recvfrom function is NOT always using the buffer size of a newly created buffer (CVE-2015-7547)」:

In the course of our investigation, and to our surprise, we learned that the glibc maintainers had previously been alerted of the issue via their bug tracker in July, 2015. (bug). We couldn't immediately tell whether the bug fix was underway, so we worked hard to make sure we understood the issue and then reached out to the glibc maintainers. To our delight, Florian Weimer and Carlos O’Donell of Red Hat had also been studying the bug’s impact, albeit completely independently! Due to the sensitive nature of the issue, the investigation, patch creation, and regression tests performed primarily by Florian and Carlos had continued “off-bug.”

攻擊本身需要繞過反制機制 (像是 ASLR),但仍然是可行的,Google 的人已經成功寫出 exploit code:

Remote code execution is possible, but not straightforward. It requires bypassing the security mitigations present on the system, such as ASLR. We will not release our exploit code, but a non-weaponized Proof of Concept has been made available simultaneously with this blog post.

技術細節在 Google 的文章裡也有提到,buffer 大小固定為 2048 bytes,但取得時有可能超過 2048 bytes,於是造成 buffer overflow:

glibc reserves 2048 bytes in the stack through alloca() for the DNS answer at _nss_dns_gethostbyname4_r() for hosting responses to a DNS query.

Later on, at send_dg() and send_vc(), if the response is larger than 2048 bytes, a new buffer is allocated from the heap and all the information (buffer pointer, new buffer size and response size) is updated.

另外 glibc 官方的 mailing list 上也有說明:「[PATCH] CVE-2015-7547 --- glibc getaddrinfo() stack-based buffer overflow」。