Spectre 的精華在於 CPU 支援 branch prediction 與 out-of-order execution,也就是 CPU 遇到 branch 時會學習怎麼跑,這個資訊提供給 out-of-order execution 就可以大幅提昇執行速度。可以參考以前在「CPU Branch Prediction 的成本...」提到的效率問題。
原理的部份可以看這段程式碼:
這類型程式碼常常出現在現代程式的各種安全檢查上:確認 x 沒問題後再實際將資料拉出來處理。而我們可以透過不斷的丟 x 值進去,讓 CPU 學到以為都是 TRUE,而在 CPU 學壞之後,突然丟進超出範圍的 x,產生 branch misprediction,但卻已經因為 out-of-order execution 而讓 CPU 執行過 y = ... 這段指令,進而導致 cache 的內容改變。
Suppose register R1 contains a secret value. If the speculatively executed memory read of array1[R1] is a cache hit, then nothing will go on the memory bus and the read from [R2] will initiate quickly. If the read of array1[R1] is a cache miss, then the second read may take longer, resulting in different timing for the victim thread.
所以相同道理,利用乘法器被佔用的 timing attack 也可以產生攻擊:
if (false but mispredicts as true)
multiply R1, R2
multiply R3, R4
In addition, of the three user-mode serializing instructions listed by Intel, only cpuid can be used in normal code, and it destroys many registers. The mfence and lfence (but not sfence) instructions also appear to work, with the added benefit that they do not destroy register contents. Their behavior with respect to speculative execution is not defined, however, so they may not work in all CPUs or system configurations.
However, we may manipulate its generation to control speculative execution while modifying the visible, on-stack value to direct how the branch is actually retired.
With GCC 6.3.0 on an i7-6700, my decoder is about 20% faster than the DFA decoder in the benchmark. With Clang 3.8.1 it’s just 1% faster.
而後來的更新則是大幅改善,在 Clang 上 DFA 版本比 branchless 的快:
Update: Björn pointed out that his site includes a faster variant of his DFA decoder. It is only 10% slower than the branchless decoder with GCC, and it’s 20% faster than the branchless decoder with Clang. So, in a sense, it’s still faster on average, even on a benchmark that favors a branchless decoder.
所以作者最後也有說這是個嘗試而已 XD:
It’s just a different approach. In practice I’d prefer Björn’s DFA decoder.
Merging eleven previously non-disclosed branches into master just before a release is not ideal but done so to minimize the security impact on existing users when the problems get known.
My plan is to merge them all into master and push around 48 hours before release, watch the autobuilds closesly, have a few extra coverity scans done and then fix up what's found before the release.