Intel 對於在 E-cores 上面可以跑 AVX-512 指令集的計畫:AVX10.2

看到「Intel AVX10.2 ISA to enable AVX-512 capabilities on E-cores」這篇提到了 Intel 的技術文件「The Converged Vector ISA: Intel® Advanced Vector Extensions 10」,裡面提到了 Intel 後續對 AVX-512 的計畫。

主要是這張,可以看到在 AVX10.2 的規劃中會支援 E-cores:

不過目前還要等,這邊只放了一個 future 的說明:

目前的傳言是 2024 或 2025 會有 AVX10.1 在 Xeon 上出來:

Intel says that version 1 of the AVX10 vector ISA (AVX10.1) will first be implemented on Intel Xeon “Granite Rapids” processors that, according to some media reports, are expected to launch by 2024 or 2025, so it will likely take a long while before AVX10.2 is implemented on processors with E-cores.

但 AVX10.1 還沒有在 E-cores 上面執行 AVX512 的能力,所以 AVX10.2 應該是更後面...

Intel 用 AVX-512 加速 NumPy 的排序演算法被整合進主線了

IntelAVX-512 加速 NumPy 排序的實做被整合進主線了:「「Intel Publishes Blazing Fast AVX-512 Sorting Library, Numpy Switching To It For 10~17x Faster Sorts」」。

GitHub 的 PR 在「ENH: Vectorize quicksort for 16-bit and 64-bit dtype using AVX512 #22315」這邊,可以看到相關的留言:

This patch adds AVX512 based 64-bit on AVX512-SKX and 16-bit sorting on AVX512-ICL. All the AVX512 sorting code has been reformatted as a separate header files and put in a separate folder. The AVX512 64-bit sorting is nearly 10x faster and AVX512 16-bit sorting is nearly 16x faster when compared to std::sort. Still working on running NumPy benchmarks to get exact benchmark numbers

16-bit int sped up by 17x and float64 by nearly 10x for random arrays. Benchmarked on a 11th Gen Tigerlake i7-1165G7.

有點「有趣」的情況是,AVX-512 在新的 Intel 消費級 CPU 被拔掉了,只有伺服器工作站的 CPU 有保留。而 AMDZen 4 則是跳下去支援 AVX-512...

另外在「Intel Publishes Fast AVX-512 Sorting Library, 10~17x Faster Sorts in NumPy (phoronix.com)」這邊當然也有人提到,如果用更廣泛的 AVX2 (寬度是 256bits) 加速的話應該也會有很大的進步才對?幾乎這十年的 CPU 都有 AVX2 了... 不過看起來沒有什麼深入的討論?

這次 OpenSSL 的兩個 CVE

難得在 Hacker News 首頁上看到 OpenSSLCVE:「OpenSSL Security Advisory [5 July 2022]」,相關的討論在「OpenSSL Security Advisory (openssl.org)」。

第一個 CVE 是 RCE 等級,但觸發條件有點多:

首先是 RSA 2048bits,這個條件應該算容易發生的。

第二個是,因為這個安全問題是因為 OpenSSL 3.0.4 才引入的程式碼,而 OpenSSL 3.0.4 是 2022/06/21 發表的,未必有很多人有升級。

第三個是,因為這次出包的段落是用到了 AVX-512 指令集,一定要 Intel 或是 Centaur 的 CPU,後面這家公司前身就是威盛 (VIA) 的一員,去年賣給了 Intel (然後發現連官網用的 domain 都沒續約...)。

AMD 雖然在 Zen 4 架構上支援 AVX-512,但還沒推出產品,所以直接閃避 XD

另外第三個還有額外的限制,因為這次用到的是 IFMA 指令集,所以也不是所有有支援 AVX-512 的 CPU 都會中獎:

只看 Intel 的部份,第一個支援 IFMA 的是 2018 年推出的 Cannon Lake,這個架構只有一顆行動版的 Intel® Core™ i3-8121U Processor

真正大量支援 IFMA 的是 2019 後的 Intel CPU 了,但到了去年推出的 Alder Lake 因為 E-core 不支援 AVX-512 的關係 (但 P-core 支援),預設又關掉了。

所以如果問這個 bug 嚴不嚴重,當然是很嚴重,但影響範圍就有點微妙了。

接下來講第二個 CVE,是 AES OCB 的實做問題,比較有趣的地方是 Hacker News 上的討論引出了 Mosh 的作者跳出來說明,他居然提到他們在二月的時候試著換到 OpenSSL 的 AES OCB 時有測出這個 bug,被 test case 擋下來了:

Mosh uses AES-OCB (and has since 2011), and we found this bug when we tried to switch over to the OpenSSL implementation (away from our own ocb.cc taken from the original authors) and Launchpad ran it through our CI testsuite as part of the Mosh dev PPA build for i686 Ubuntu. (It wasn't caught by GitHub Actions because it only happens on 32-bit x86.) https://github.com/mobile-shell/mosh/issues/1174 for more.

So I would say (a) OCB is widely used, at least by the ~million Mosh users on various platforms, and (b) this episode somewhat reinforces my (perhaps overweight already) paranoia about depending on other people's code or the blast radius of even well-meaning pull requests. (We really wanted to switch over to the OpenSSL implementation rather than shipping our own, in part because ours was depending on some OpenSSL AES primitives that OpenSSL recently deprecated for external users.)

Maybe one lesson here is that many people believe in the benefits of unit tests for their own code, but we're not as thorough or experienced in writing acceptance tests for our dependencies.

Mosh got lucky this time that we had pretty good tests that exercised the library enough to find this bug, and we run them as part of the package build, but it's not that farfetched to imagine that we might have users on a platform that we don't build a package for (and therefore don't run our testsuite on).

這有點有趣 XDDD

Linus 狂幹 Intel 的 AVX-512

這幾天蠻熱鬧的消息,Linus 幹翻 Intel 丟出來的 AVX-512:「Alder Lake and AVX-512」。

在維基百科的「Advanced Vector Extensions」這邊有提到,因為 AVX-512 執行時會消耗產生更多的熱量,所以得壓低 Turbo Boost 執行:

Since AVX instructions are wider and generate more heat, Intel processors have provisions to reduce the Turbo Boost frequency limit when such instructions are being executed. The throttling is divided into three levels:

  • L0 (100%): The normal turbo boost limit.
  • L1 (~85%): The "AVX boost" limit. Soft-triggered by 256-bit "heavy" (floating-point unit: FP math and integer multiplication) instructions. Hard-triggered by "light" (all other) 512-bit instructions.
  • L2 (~60%): The "AVX-512 boost" limit. Soft-triggered by 512-bit heavy instructions.

本來 AVX 與 AVX-2 只會在某些重量級的指令時會壓 15%,現在在 AVX-512 則是變成常態,而且有些會降到 40%,對於同時在跑的應用會受到很大的影響,所以 Linus 也直接放話要用他的權限擋這件事情 (我把動詞讀錯了):

I want my power limits to be reached with regular integer code, not with some AVX512 power virus that takes away top frequency (because people ended up using it for memcpy!) and takes away cores (because those useless garbage units take up space).

在後面的討論串「Alder Lake and AVX-512」這邊 Linus 有提到更細,像是他對於 MMX/SSE/AVX/AVX2 的想法,以及為什麼他這麼厭惡 AVX-512。

AMD 的繼續看戲 XDDD

超快速的 Base64 encoding/decoding 實做

看到「Base64 encoding and decoding at almost the speed of a memory copy」這個,可以超級快速編解碼 Base64 的資料。

實做上是透過 IntelAVX-512 加速,在資料夠大的情況下 (超過 L1 cache 的大小),可以達到接近字串複製的速度 (這邊提到的 memcpy()):

We show how we can encode and decode base64 data at nearly the speed of a memory copy (memcpy) on recent Intel processors, as long as the data does not fit in the first-level (L1) cache. We use the SIMD (Single Instruction Multiple Data) instruction set AVX-512 available on commodity processors. Our implementation generates several times fewer instructions than previous SIMD-accelerated base64 codecs.

不過這樣 AMD 暫時要哭哭...

最新的 Firefox 56 對 AES-GCM 效能的改善

昨天釋出的 Firefox 56 對於 AES-GCM 在老電腦上改善了不少效能:「Improving AES-GCM Performance」。

首先是 Firefox 自己的數據分析,可以看到 AES-GCM 佔目前加密連線裡的大宗,再來是 AES-CBC:

先以 Linux 64bits 環境的數據來看,Firefox 56 的 NSS 3.32 大幅改善了老電腦的效能 (不支援 AES-NI 硬體加解密的 CPU,甚至是不支援 PCLMUL 的 CPU,以及不支援 AVX 的 CPU):

在 Linux 32bits 環境上則是連預設值大幅改善,不過用的人應該少很多了:

Windows 下則是因為 64bits 或是 32bits 都有足夠的使用者,所以平常就花了不少力氣。但也可以看出對於老電腦的速度提升:

Mac (64bits only) 算是這次比較大的提升,連新電腦的預設值都大幅變快:

加上之後陸續的改善 (尤其是下一版 Firefox 57 的 Project Quantum),這幾版應該會拉出不少效能...

Google 發表了三個 Hash 演算法的實作

Google 發表了三個 Hash 演算法的實作:「New algorithms may lower the cost of secure computing」。

第一個是 SipHash 的加速實作,透過 AVX-2 指令集加速,看維基百科的資料,2011 後的 Intel/AMD CPU 似乎都有提供這組指令集:

Our first hash function produces the same output as SipHash, but 1.5 times as quickly thanks to AVX-2 instructions.

第二個是 SipHash 的改良版,但輸出不同 (所以不是 SipHash),但速度比 SipHash 更快:

The second improvement uses j-lanes tree hashing to process multiple inputs in parallel, which is 3 times as fast. This technique is known to be secure, but produces different output than the original SipHash and is slightly slower for short inputs.

第三個則是新的 Hash,速度比前兩者又更快了,但還需要有更多人分析才能確認安全性:

HighwayHash is based on a new way of mixing inputs with just a few AVX-2 multiply and permute instructions. We are hopeful that the result is a cryptographically strong pseudorandom function, but new cryptanalysis methods might be needed for analyzing this promising family of hash functions. HighwayHash is significantly faster than SipHash for all measured input sizes, with about 7 times higher throughput at 1 KiB.

三者的程式碼都可以在 GitHub 上的「google/highwayhash」找到,看 LICENSE 檔案是 Apache License 2.0