intel – Page 4 – Gea-Suan Lin's BLOG

Intel 的 RDRAND 爆炸...

在正妹 wens 的 Facebook 上看到的，Intel 的 RDRAND 因為有安全漏洞 (CrossTalk/SRBDS)，新推出的修正使得 RDRAND 只有原來的 3% 效能：

從危機百科上看，大概是因為這個指令集有 compliance 的要求，所以這個安全性漏洞必須在安全性上修到乾淨，所以使用了暴力鎖硬解，造成效能掉這麼多：

The random number generator is compliant with security and cryptographic standards such as NIST SP 800-90A, FIPS 140-2, and ANSI X9.82.

不過畢竟這個指令不是常常被使用，一般使用者的影響應該是還好：

As explained in the earlier article, mitigating CrossTalk involves locking the entire memory bus before updating the staging buffer and unlocking it after the contents have been cleared. This locking and serialization now involved for those instructions is very brutal on the performance, but thankfully most real-world workloads shouldn't be making too much use of these instructions.

另外這個漏洞早在 2018 九月的時候就通報 Intel 提了，但最後花了超過一年半時間才更新，這算是當初在提 Bug Bounty 制度時可能的缺點，在這次算是比較明顯：

We disclosed an initial PoC (Proof-Of-Concept) showing the leakage of staging buffer content in September 2018, followed by a PoC implementing cross-core RDRAND/RDSEED leakage in July 2019. Following our reports, Intel acknowledged the vulnerabilities, rewarded CrossTalk with the Intel Bug Bounty (Side Channel) Program, and attributed the disclosure to our team with no other independent finders. Intel also requested an embargo until May 2020 (later extended), due to the difficulty of implementing a fix for the cross-core vulnerabilities identified in this paper.

回到原來的 bug，主要還是 Intel 架構上的問題造成大家打得很愉快，現在 Intel 這邊的架構對於資安研究員仍然是個大家熱愛的地方... (因為用的使用者太多)

AMD Ryzen Threadripper 3990X 在 Windows 上的效能

John Carmack 注意到在 AMD Ryzen Threadripper 3990X 上因為 Windows 的 group limit 限制而造成效能問題：

AMD 3990 CPU scaling tests: Because of the Windows group limit of 64 CPUs, just firing up a lot of C++ std::threads didn't work great:

128 t = 67 s
64 t = 63 s
32 t = 84 s
16 t = 160 s
8 t = 312 s

32 to 64 threads wasn't a big boost, and 64 to 128 was slower. However!

— John Carmack (@ID_AA_Carmack) April 11, 2020

但這點可以透過打散到兩個 group 改善 (workaround) 而提昇速度：

Setting the group explicitly let it scale all the way up:

128 t = 38 s
64 t = 48 s
32 t = 84 s
16 t = 160 s
8 t = 312 s

Notably, because each group gets 32 hyperthreaded cores, 64 threads across 2 groups on an unloaded system is much faster because they are all alone on a core pic.twitter.com/Ip4OZsTXah

— John Carmack (@ID_AA_Carmack) April 11, 2020

然後順便看了一下目前 CPU Benchmark 網站上對於高階 CPU 的跑分數據「PassMark - CPU Mark High End CPUs)」，可以看到 AMD 最近真是香噴噴的，用 3950X (16C/32T，105W) 殺 Intel 目前最高分的 W-3275M (28C/56T，205W)，然後那個價差：

Intel 的 14nm 牙膏繼續擠...

超快速的 Base64 encoding/decoding 實做

看到「Base64 encoding and decoding at almost the speed of a memory copy」這個，可以超級快速編解碼 Base64 的資料。

實做上是透過 Intel 的 AVX-512 加速，在資料夠大的情況下 (超過 L1 cache 的大小)，可以達到接近字串複製的速度 (這邊提到的 memcpy())：

We show how we can encode and decode base64 data at nearly the speed of a memory copy (memcpy) on recent Intel processors, as long as the data does not fit in the first-level (L1) cache. We use the SIMD (Single Instruction Multiple Data) instruction set AVX-512 available on commodity processors. Our implementation generates several times fewer instructions than previous SIMD-accelerated base64 codecs.

不過這樣 AMD 暫時要哭哭...

家裡電腦裝 Ubuntu 18.04

上個禮拜四家裡的桌機開不了機，找了一天發現是系統的 SSD 掛掉了，就買了張 M.2 SSD，然後計畫順便把本來的 Ubuntu 16.04 升級到 Ubuntu 18.04，但 Ubuntu 18.04 把預設的界面從 Unity 換成 GNOME (然後披上 Unity 的皮)，加上前陣子系統從 Intel 平台換到 AMD，整個狀況變得超混亂之後，就變成一連串踩地雷的過程...

最一開始是 UEFI + LUKS 的安裝問題，本來想裝到 M.2 SSD 上面，但 Ubuntu 18.04 的 grub-install 就是硬寫到 /dev/sda 不能改：「“Unable to install GRUB in /dev/sda” when installing GRUB」，照著這篇的 workaround 用還是不行，最後放棄，直接生一顆 SATA SSD 接到 SATA Port 1，把 M.2 當作資料碟。

硬體相關的問題：

AMD 的 3700X 在 Linux 下還是有人會遇到 C6 bug，如果有遇到的話可以透過 BIOS 的設定 (workaround) 避免：「Can we recognize broken C6 states in all of Zen, Zen+ and Zen2? Striking people at idle (Mostly only Linux and BSD users)」
華碩的主機板內建了 Intel I211-AT 這個網路卡 (家裡用 ASUS ROG STRIX B450-F GAMING，公司用 ASUS PRIME X470-PRO)，但在 Linux 下用起來很卡 (kernel 版本是 4.15 與 5.0)，與網路有關的指令有時候會卡個幾秒鐘 (像是 ifconfig 或是 ip link 這樣的指令)。
續上，家裡的因為剛好週末有時間交叉測試，測了三四天後嘗試另外插一張 Intel 的網路卡 (PCI-E) 後改接過去就解決了... 到公司後先拿 USB 網卡測試看看，如果真的就不卡的話再找一張 PCI-E 的網卡換掉。

軟體相關的問題：

目前不支援從 GUI 設定 PPPoE 的網路 (沃槽)，幾種方式裡面我推薦用 pppoeconf 設定會比較好，然後可以改 /etc/ppp/options 加上 IPv6 的設定。
本來想裝 gnome-shell-extension-system-monitor 觀察系統狀態，但會造成系統超級卡，關掉後就變成普通的卡 (後來就找到 Intel I211-AT 的那個問題了)。

現在至少是堪用的程度了，接下來就是不斷的補各種設定...

Linux 上 Intel CPU 的安全性修正與效能的影響

在 Hacker News Daily 上看到在講 Intel CPU 因為各種安全性問題，而需要在 Linux Kernel 上修正，所產生的效能問題：「HOWTO make Linux run blazing fast (again) on Intel CPUs」。

這一系列的子彈也飛得夠久了 (雖然還是一直有其他的小子彈在飛)，所以回過頭來看一下目前的情況。

這邊主要的測試是針對 mitigations=off 與 SMT 的啟用兩個項目在測 (SMT 在 Intel 上叫做 Hyper-threading)，可以看到這兩份測試結果，目前的 mitigation 對效能的影響其實已經逐漸降到可以接受的程度 (小於 5%)，但關閉 SMT 造成的效能影響大約都在 20%～30%：

但是開啟 SMT 基本上是個大坑，如果有關注大家在挖洞的對象，可以看到一堆 Intel CPU 上專屬的安全性問題都跟 SMT 有關...

剛好岔個題聊一下，先前弄了一顆 AMD 的 Ryzen 7 3700X 在用 (也是跑 Linux 桌機)，才感受到現在的網頁真的很吃 CPU，開個網頁版的 Slack 與 Office 365 的速度比原來的老機器快了好多，差點想要把家裡的桌機也換掉...

Google Chrome 對 CPU bug 的 patch

既然有方向了，後續應該會有人去找底層的問題...

先是在 Hacker News 上看到「Speculative fix to crashes from a CPU bug」這個猜測性的修正，這是因為他們發現在 Intel 的 Gemini Lake 低功耗晶片組上會發生很詭異的 crash：

For the last few months Chrome has been seeing many "impossible" crashes on Intel Gemini Lake, family 6 model 122 stepping 1 CPUs. These crashes only happen with 64-bit Chrome and only happen in the prologue of two functions. The crashes come and go across different Chrome versions.

然後依照 crash log 猜測跟 alignment 有關，所以決定用 gcc/clang 都有支援的 __attribute__ 強制設定 alignment 來避開，但看起來手上沒有可以重製的環境，所以只能先把實做丟上來...

EC2 的 C5 系列機器推出 12x、24x 以及 metal 類別

EC2 的 c5 系列主機推出了 12xlarge、24xlarge 以及 metal 三種新的類別：「Now Available: New C5 instance sizes and bare metal instances」。

本來的 c5 是 large、xlarge、2xlarge、4xlarge 與 9xlarge (這邊有一些不規則了)，現在多了 12xlarge 與 24xlarge，應該是大到需要依照硬體規格對齊數字了？

另外一個是，c5.24xlarge 跟 c5.metal 其實是一樣的硬體規格 (以及價錢)，主要的差異在 c5.metal 沒有虛擬化，所以除了壓榨效能以外，還可以看到很多硬體資訊：

do not want to take the performance hit of nested virtualization,
need access to physical resources and low-level hardware features, such as performance counters and Intel VT that are not always available or fully supported in virtualized environments,
are intended to run directly on the hardware, or licensed and supported for use in non-virtualized environments.

Ubuntu 19.10 要放掉 i386 架構

Ubuntu 19.10 版將不再支援 i386 架構了：「i386 architecture will be dropped starting with eoan (Ubuntu 19.10)」。

查了一下 x86-64 條目，AMD 的第一個 x86-64 版本是在 2003 年四月推出的：

The first AMD64-based processor, the Opteron, was released in April 2003.

Intel 則是在 2004 年六月推出：

The first processor to implement Intel 64 was the multi-socket processor Xeon code-named Nocona in June 2004.

但是 mobile 版的是 2006 年七月：

The first Intel mobile processor implementing Intel 64 is the Merom version of the Core 2 processor, which was released on July 27, 2006.

不論如何都已經十年了，如果考慮到 Ubuntu 18.04 提供五年支援，其實到 2023 年四月前都還有得用...

在一連串的安全更新後，AMD 的 CPU 比 Intel 快了...

在「Intel Performance Hit 5x Harder Than AMD After Spectre, Meltdown Patches」這邊看到的測試：

把現有的安全性更新都開啟後，Intel CPU 的效能掉了 20% 左右 (在 Intel 上需要把 HT 關掉)：

While the impacts vary tremendously from virtually nothing too significant on an application-by-application level, the collective whack is ~15-16 percent on all Intel CPUs without Hyper-Threading disabled. Disabling increases the overall performance impact to 20 percent (for the 7980XE), 24.8 percent (8700K) and 20.5 percent (6800K).

The AMD CPUs are not tested with HT disabled, because disabling SMT isn’t a required fix for the situation on AMD chips, but the cumulative impact of the decline is much smaller. AMD loses ~3 percent with all fixes enabled

可以注意到兩家目前桌機的頂規在上了安全性更新後，AMD 的 2990WX 比 Intel 的 7980XE 快了一些些... 當初擠牙膏擠出來的都吐回去了，不知道之後出的 security issue 還需要再吐多少回去。

實際比較 Linode 的 Dedicated 主機與 AWS 的 c5.*

先前有提到 Linode 出了 Dedicated 主機：「Linode 推出 Dedicated CPU Instances」，現在找機會測試看看，拿了 Linode 的 Dedicated (4GB) 與 AWS 的 c5.large 比較，同樣都是 2 vCPU 與 4GB RAM。

這邊用了 n-st/nench 與 OpenSSL 的 speed (包括了 aes、md5、rsa、sha1 與 sha256) 測試，我把結果都貼到這邊：「Linode (Dedicated 4GB) v.s. AWS (c5.large)」。

可以看到在 CPU 方面主要的差異是 Linode 用的是 AMD，而 AWS 用的是 Intel，所以就會有蠻多不同的數字表現...

如果仔細看 OpenSSL 的測試數據，可以看到不同演算法的差異還蠻大的，馬上可以想到的應該是硬體加速方式與 cache 架構差異造成的：

在 cipher 類的測試我只測了 AES (目前的主流)，小的 block (16/64/256 bytes) 時 AMD 會輸一些，但大的 block (1024/8192/16384 bytes) 反而會贏不少。
在 hash 類的測試中，跑 MD5 時 Linode 則是輸一些，但 SHA1 反而是贏一些，然後 SHA256 時效能好到爆炸贏了一倍 XDDD
在 public key 類的測試我測了 RSA，則是 Linode 輸的蠻慘的...

如果考慮到價位大約只有 AWS 的一半，應該是還不錯...