從簡單的 C 語言函式來看現代 Compiler 使用 SIMD 的威力

兩個禮拜前在 Hacker News Daily 上看到這篇很精彩的問題與分析,裡面展現出了現代 compiler 最佳化的能力,大量使用了 SIMD 來衝效能:「Why does this code execute more slowly after strength-reducing multiplications? (stackoverflow.com)」,原文在 Stack Overflow 上:「Why does this code execute more slowly after strength-reducing multiplications to loop-carried additions?」。

這篇會很長,除了本來 Stack Overflow 上的討論以外,我另外自己測 GCC 9.4.0 不加上 -O、加上 -O-O3,發現這次 Stack Overflow 給的範例剛剛好把這幾個常見的最佳化等級都練出不同結果,算是蠻厲害的題目。

作者一開始是寫了一個很簡單的版本 A,會透過 loop (對 i 進行) 計算 A*i^2 + B*i + C 的值,把結果放到 array 裡面:

double data[LEN];

void compute()
{
    const double A = 1.1, B = 2.2, C = 3.3;

    int i;
    for(i=0; i<LEN; i++) {
        data[i] = A*i*i + B*i + C;
    }
}

透過一些紙本公式計算可以知道,每次遞增的值雖然不是固定值,但也是有規律的:

所以可以改寫成一堆加號的版本 B:

void compute()
{
    const double A = 1.1, B = 2.2, C = 3.3;
    const double A2 = A+A;
    double Z = A+B;
    double Y = C;

    int i;
    for(i=0; i<LEN; i++) {
        data[i] = Y;
        Y += Z;
        Z += A2;
    }
}

理想上版本 A 在 loop 內用到三個乘法與兩個加法,而版本 B 只用到了三個加法,預期版本 B 應該會快不少,但實際上跑出來的結果剛好反過來:版本 B 慢了許多。

作者實際用 objdump 拉出來看,粗粗看下來也會發現版本 A 的指令多很多:

而版本 B 的指令簡單很多:

在討論下面已經有人給出解釋,主要的原因包括了兩個。

首先是現代 CPU 靠著暴力電路解決,乘法速度跟加法其實不像以前差那麼多,可以從 Instruction tables 這邊看到 MUL 類的指令速度雖然不能跟加法相比,但其實不算慢了,反倒是 DIV 整數除法類的指令比較痛。

另外一個原因,如果仔細看作者貼的 screenshot 分析會發現,在版本 A 裡面,一個 loop 其實做了四次 i 的運算 (add rax, 0x20),而版本 B 只做了一個 i 的運算 (add rax, 0x8),這邊 compiler 幫你 unroll 最佳化改用 SIMD 處理掉了。

在 Stack Overflow 的回答裡面,有人給了一段不錯的 code 示意,提到版本 A 其實先被展成像是這樣的程式碼:

int i;
for (i = 0; i < LEN; i += 4) {
    data[i+0] = A*(i+0)*(i+0) + B*(i+0) + C;
    data[i+1] = A*(i+1)*(i+1) + B*(i+1) + C;
    data[i+2] = A*(i+2)*(i+2) + B*(i+2) + C;
    data[i+3] = A*(i+3)*(i+3) + B*(i+3) + C;
}

然後被 SIMD 包起來處理掉了。

我把作者的 code (他有貼在 GitHub Gist 上) 拿下來編,用不同的 -O-O3 測試,然後去讀 assmebly 的部份也可以看到很多有趣的東西...

首先是在 -O3 的情況下 (也就是作者使用的參數),可以看到類似的結果:(我桌機的 CPU 是定速,沒有跑動態調整)

$ repeat 10 ./a
[-] Took: 248830 ns.
[-] Took: 249150 ns.
[-] Took: 248760 ns.
[-] Took: 248730 ns.
[-] Took: 248770 ns.
[-] Took: 248861 ns.
[-] Took: 248760 ns.
[-] Took: 253050 ns.
[-] Took: 248640 ns.
[-] Took: 249211 ns.
$ repeat 10 ./b
[-] Took: 686660 ns.
[-] Took: 696090 ns.
[-] Took: 696310 ns.
[-] Took: 694431 ns.
[-] Took: 691971 ns.
[-] Took: 697690 ns.
[-] Took: 693241 ns.
[-] Took: 692900 ns.
[-] Took: 654751 ns.
[-] Took: 679101 ns.

從版本 A 的 objdump -d -S -M intel a 可以看到作者 screenshot 內也有看的 unroll 與 SSE2 指令集:

13a0:       66 0f 6f c2             movdqa xmm0,xmm2
13a4:       48 83 c0 20             add    rax,0x20
13a8:       66 0f fe d6             paddd  xmm2,xmm6
13ac:       f3 0f e6 f8             cvtdq2pd xmm7,xmm0
13b0:       66 0f 28 cf             movapd xmm1,xmm7
13b4:       66 0f 70 c0 ee          pshufd xmm0,xmm0,0xee
13b9:       66 0f 59 cd             mulpd  xmm1,xmm5
13bd:       f3 0f e6 c0             cvtdq2pd xmm0,xmm0
13c1:       66 0f 59 cf             mulpd  xmm1,xmm7
13c5:       66 0f 59 fc             mulpd  xmm7,xmm4
13c9:       66 0f 58 cf             addpd  xmm1,xmm7
13cd:       66 0f 58 cb             addpd  xmm1,xmm3
13d1:       0f 29 48 e0             movaps XMMWORD PTR [rax-0x20],xmm1
13d5:       66 0f 28 c8             movapd xmm1,xmm0
13d9:       66 0f 59 cd             mulpd  xmm1,xmm5
13dd:       66 0f 59 c8             mulpd  xmm1,xmm0
13e1:       66 0f 59 c4             mulpd  xmm0,xmm4
13e5:       66 0f 58 c1             addpd  xmm0,xmm1
13e9:       66 0f 58 c3             addpd  xmm0,xmm3
13ed:       0f 29 40 f0             movaps XMMWORD PTR [rax-0x10],xmm0
13f1:       48 39 c2                cmp    rdx,rax
13f4:       75 aa                   jne    13a0 <compute+0x40>

而版本 B 的 objdump -d -S -M intel b 也符合作者提到的現象:

1340:       f2 0f 11 08             movsd  QWORD PTR [rax],xmm1
1344:       48 83 c0 08             add    rax,0x8
1348:       f2 0f 58 c8             addsd  xmm1,xmm0
134c:       f2 0f 58 c2             addsd  xmm0,xmm2
1350:       48 39 d0                cmp    rax,rdx
1353:       75 eb                   jne    1340 <compute+0x30>

但把 gcc 改成 -O 後,可以看到版本 A 的速度慢很多,但還是稍微比版本 B 快一些:

$ repeat 10 ./a
[-] Took: 571140 ns.
[-] Took: 570280 ns.
[-] Took: 571271 ns.
[-] Took: 573971 ns.
[-] Took: 571981 ns.
[-] Took: 569650 ns.
[-] Took: 566361 ns.
[-] Took: 571600 ns.
[-] Took: 571330 ns.
[-] Took: 571030 ns.
$ repeat 10 ./b
[-] Took: 697521 ns.
[-] Took: 696961 ns.
[-] Took: 696201 ns.
[-] Took: 694921 ns.
[-] Took: 696930 ns.
[-] Took: 695001 ns.
[-] Took: 701661 ns.
[-] Took: 698100 ns.
[-] Took: 702430 ns.
[-] Took: 702641 ns.

從 objdump 可以看到版本 A 的變化,退化成一次只處理一個,但把所有的數字都用 xmmN 存放計算:

11b1:       66 0f ef c9             pxor   xmm1,xmm1
11b5:       f2 0f 2a c8             cvtsi2sd xmm1,eax
11b9:       66 0f 28 c1             movapd xmm0,xmm1
11bd:       f2 0f 59 c4             mulsd  xmm0,xmm4
11c1:       f2 0f 59 c1             mulsd  xmm0,xmm1
11c5:       f2 0f 59 cb             mulsd  xmm1,xmm3
11c9:       f2 0f 58 c1             addsd  xmm0,xmm1
11cd:       f2 0f 58 c2             addsd  xmm0,xmm2
11d1:       f2 0f 11 04 c2          movsd  QWORD PTR [rdx+rax*8],xmm0
11d6:       48 83 c0 01             add    rax,0x1
11da:       48 3d 40 42 0f 00       cmp    rax,0xf4240
11e0:       75 cf                   jne    11b1 <compute+0x28>

而版本 B 在 -O 的情況下基本上是一樣的東西 (所以速度上差不多):

11b3:       f2 0f 11 08             movsd  QWORD PTR [rax],xmm1
11b7:       f2 0f 58 c8             addsd  xmm1,xmm0
11bb:       f2 0f 58 c2             addsd  xmm0,xmm2
11bf:       48 83 c0 08             add    rax,0x8
11c3:       48 39 d0                cmp    rax,rdx
11c6:       75 eb                   jne    11b3 <compute+0x2a>

再來是拔掉 -O,都不加就會超慢:

$ repeat 10 ./a
[-] Took: 1097091 ns.
[-] Took: 1092941 ns.
[-] Took: 1092501 ns.
[-] Took: 1091991 ns.
[-] Took: 1092441 ns.
[-] Took: 1093970 ns.
[-] Took: 1091341 ns.
[-] Took: 1093931 ns.
[-] Took: 1094111 ns.
[-] Took: 1092231 ns.
$ repeat 10 ./b
[-] Took: 2703282 ns.
[-] Took: 2705933 ns.
[-] Took: 2703582 ns.
[-] Took: 2702622 ns.
[-] Took: 2703043 ns.
[-] Took: 2702262 ns.
[-] Took: 2703352 ns.
[-] Took: 2703532 ns.
[-] Took: 2703112 ns.
[-] Took: 2702533 ns.

看 objdump 就可以發現幾乎都是對記憶體操作,沒有放到 register 裡面,這是版本 A:

11c1:       f2 0f 2a 45 e4          cvtsi2sd xmm0,DWORD PTR [rbp-0x1c]
11c6:       66 0f 28 c8             movapd xmm1,xmm0
11ca:       f2 0f 59 4d e8          mulsd  xmm1,QWORD PTR [rbp-0x18]
11cf:       f2 0f 2a 45 e4          cvtsi2sd xmm0,DWORD PTR [rbp-0x1c]
11d4:       f2 0f 59 c8             mulsd  xmm1,xmm0
11d8:       f2 0f 2a 45 e4          cvtsi2sd xmm0,DWORD PTR [rbp-0x1c]
11dd:       f2 0f 59 45 f0          mulsd  xmm0,QWORD PTR [rbp-0x10]
11e2:       f2 0f 58 c1             addsd  xmm0,xmm1
11e6:       f2 0f 58 45 f8          addsd  xmm0,QWORD PTR [rbp-0x8]
11eb:       8b 45 e4                mov    eax,DWORD PTR [rbp-0x1c]
11ee:       48 98                   cdqe   
11f0:       48 8d 14 c5 00 00 00    lea    rdx,[rax*8+0x0]
11f7:       00 
11f8:       48 8d 05 41 2e 00 00    lea    rax,[rip+0x2e41]
11ff:       f2 0f 11 04 02          movsd  QWORD PTR [rdx+rax*1],xmm0
1204:       83 45 e4 01             add    DWORD PTR [rbp-0x1c],0x1
1208:       81 7d e4 3f 42 0f 00    cmp    DWORD PTR [rbp-0x1c],0xf423f
120f:       7e b0                   jle    11c1 <compute+0x38>

這是版本 B:

11e8:       8b 45 cc                mov    eax,DWORD PTR [rbp-0x34]
11eb:       48 98                   cdqe   
11ed:       48 8d 14 c5 00 00 00    lea    rdx,[rax*8+0x0]
11f4:       00 
11f5:       48 8d 05 44 2e 00 00    lea    rax,[rip+0x2e44]
11fc:       f2 0f 10 45 d8          movsd  xmm0,QWORD PTR [rbp-0x28]
1201:       f2 0f 11 04 02          movsd  QWORD PTR [rdx+rax*1],xmm0
1206:       f2 0f 10 45 d8          movsd  xmm0,QWORD PTR [rbp-0x28]
120b:       f2 0f 58 45 d0          addsd  xmm0,QWORD PTR [rbp-0x30]
1210:       f2 0f 11 45 d8          movsd  QWORD PTR [rbp-0x28],xmm0
1215:       f2 0f 10 45 d0          movsd  xmm0,QWORD PTR [rbp-0x30]
121a:       f2 0f 58 45 f8          addsd  xmm0,QWORD PTR [rbp-0x8]
121f:       f2 0f 11 45 d0          movsd  QWORD PTR [rbp-0x30],xmm0
1224:       83 45 cc 01             add    DWORD PTR [rbp-0x34],0x1
1228:       81 7d cc 3f 42 0f 00    cmp    DWORD PTR [rbp-0x34],0xf423f
122f:       7e b7                   jle    11e8 <compute+0x5f>

寫到這邊差不多了,作者拿的這個範例算是很有趣的例子,尤其是現代 compiler 幫我們做了超多事情後,很多自己以為的 optimization 其實未必比較好,還是要有個 profiling review 才準...

居然在安全性漏洞的 PoC 上面看到拿 Bad Apple!! 當作範例

人在日本的資安專家 Hector Martin 找到了 Apple M1 的安全漏洞,可以不用透過 macOS Big Sur 提供的界面,直接透過 M1 的漏洞跨使用者權限傳輸資料,這可以用在突破 sandbox 的限制。而也如同目前的流行,他取了一個好記的名字:「M1RACLES: M1ssing Register Access Controls Leak EL0 State」,對應的 CVECVE-2021-30747

先講比較特別的點,PoC 的影片放在 YouTube 上,作者拿 Bad Apple!! 當作示範,這很明顯是個雙關的點:

這應該是當年的影繪版本,看了好懷念啊... 當年看到的時候有種「浪費才能」的感覺,但不得不說是個經典。

Hacker News 上有討論可以翻翻:「M1racles: An Apple M1 covert channel vulnerability (m1racles.com)」。

依照作者的說明,Apple A14 因為架構類似,也有類似的問題,不過作者沒有 iPhone,沒辦法實際測試:

Are other Apple CPUs affected?

Maybe, but I don't have an iPhone or a DTK to test it. Feel free to report back if you try it. The A14 has been confirmed as also affected, which is expected, as it is a close relative of the M1.

另外作者覺得這個安全漏洞在 macOS 上還好,主要是你系統都已經被打穿可以操控 s3_5_c15_c10_1 register 了,應該會有更好的方式可以用:

So you're telling me I shouldn't worry?

Yes.

What, really?

Really, nobody's going to actually find a nefarious use for this flaw in practical circumstances. Besides, there are already a million side channels you can use for cooperative cross-process communication (e.g. cache stuff), on every system. Covert channels can't leak data from uncooperative apps or systems.

Actually, that one's worth repeating: Covert channels are completely useless unless your system is already compromised.

比較明顯的問題應該是 iOS 這邊的 privacy issue,不過 iOS 上的 app store 有基本的保護機制:(不過想到作者可以故意寫成 RCE 漏洞...)

What about iOS?

iOS is affected, like all other OSes. There are unique privacy implications to this vulnerability on iOS, as it could be used to bypass some of its stricter privacy protections. For example, keyboard apps are not allowed to access the internet, for privacy reasons. A malicious keyboard app could use this vulnerability to send text that the user types to another malicious app, which could then send it to the internet.

However, since iOS apps distributed through the App Store are not allowed to build code at runtime (JIT), Apple can automatically scan them at submission time and reliably detect any attempts to exploit this vulnerability using static analysis (which they already use). We do not have further information on whether Apple is planning to deploy these checks (or whether they have already done so), but they are aware of the potential issue and it would be reasonable to expect they will. It is even possible that the existing automated analysis already rejects any attempts to use system registers directly.