Google 釋出網頁版的 Spectre 攻擊 PoC,包括 Apple M1 在內

在大約三年前 (2018 年年初) 的時候,在讀完 Spectre 之後寫下了一些記錄:「讀書時間:Spectre 的攻擊方式」,結果在 Bruce Schneier 這邊看到消息,Google 前幾天把把 PoC 放出來了:「Exploiting Spectre Over the Internet」,在 Hacker News 上也有討論:「A Spectre proof-of-concept for a Spectre-proof web (」。

首先是這個攻擊方法在目前的瀏覽器都還有用,而且包括 Apple M1 上都可以跑:

The demonstration website can leak data at a speed of 1kB/s when running on Chrome 88 on an Intel Skylake CPU. Note that the code will likely require minor modifications to apply to other CPUs or browser versions; however, in our tests the attack was successful on several other processors, including the Apple M1 ARM CPU, without any major changes.

即使目前的瀏覽器都已經把 改為 1ms 的精度,也還是可以達到 60 bytes/sec 的速度:

While experimenting, we also developed other PoCs with different properties. Some examples include:

  • A PoC which can leak 8kB/s of data at a cost of reduced stability using as a timer with 5μs precision.
  • A PoC which leaks data at 60B/s using timers with a precision of 1ms or worse.

比較苦的消息是 Google 已經確認在軟體層沒辦法解乾淨,目前在瀏覽器上只能靠各種 isolation 降低風險,像是將不同站台跑在不同的 process 裡面:

In 2019, the team responsible for V8, Chrome’s JavaScript engine, published a blog post and whitepaper concluding that such attacks can’t be reliably mitigated at the software level. Instead, robust solutions to these issues require security boundaries in applications such as web browsers to be aligned with low-level primitives, for example process-based isolation.

Apple M1 也中這件事情讓人比較意外一點,看起來是當初開發的時候沒評估?目前傳言的 M1x 與 M2 不知道會怎樣...

Cloudflare 再次嘗試 ARM 伺服器

2018 年的時候寫過一篇 Cloudflare 在嘗試 ARM 伺服器的進展:「Cloudflare 用 ARM 當伺服器的進展...」,後來就沒有太多公開的消息,直到這幾天看到「ARMs Race: Ampere Altra takes on the AWS Graviton2」才看到原因:

By the time we completed porting our software stack to be compatible with ARM, Qualcomm decided to exit the server business.

所以是都測差不多,也都把 Cloudflare 自家的軟體搬上去了,但 Qualcomm 也決定收手,沒機器可以用...

這次再次踏入 ARM 領域讓人想到前陣子 AppleM1,讓大家看到 ARM 踏入桌機與筆電領域可以是什麼樣貌...

這次 Cloudflare 選擇了 Ampere Altra,這是基於 Neoverse N1 的平台,而這個平台的另外一個知名公司就是 AWSGraviton2,所以就拿來比較:

可以看到 Ampere Altra 的核心數多了 25% (64 vs. 80),運作頻率多了 20% (2.5Ghz vs. 3.0Ghz)。測試的結果也都有高有低,落在 10%~40% 都有。

不過其中比較特別的是 Brotli - 9 的測試特別差 (而且是 8 與 10 都正常的情況下):

依照 Cloudflare 的說法,他們其實不會用到 Brotli - 7 以及更高的等級,不過畢竟有測出來,還是花了時間找一下根本原因:

Although we do not use Brotli level 7 and above when performing dynamic compression, we decided to investigate further.

反追問題後發現跟 Page Faults 以及 Pipeline Backend Stalls 有關,不過是可以改寫避開,在避開後可以達到跟 Graviton2 類似的水準:

By analyzing our dataset further, we found the common underlying cause appeared to be the high number of page faults incurred at level 9. Ampere has demonstrated that by increasing the page size from 4K to 64K bytes, we can alleviate the bottleneck and bring the Ampere Altra at parity with the AWS Graviton2. We plan to experiment with large page sizes in the future as we continue to evaluate Altra.

但目前看起來應該都還算正向,看起來供貨如果穩定的話,應該有機會換過去?畢竟 ARM 平台可以省下來的電力太多了,現在因為 M1 對 ARM 的公關效果太驚人的關係,解釋起來會更輕鬆...

Apple M1 的效能與省電原因

Hacker News Daily 上看到 Apple M1 為什麼這麼快又省電的解釋,可以當作一種看法:

可以在 Thread reader 上面讀:「Thread by @ErrataRob on Thread Reader App – Thread Reader App」。

看起來 Apple 在規劃的時候就有考慮 x86 模擬問題,所以在記憶體架構上直接實做了對應的模式,大幅降低了當年 MicrosoftSurface 上遇到的問題:

3/ The biggest hurdle was "memory-ordering", the order in which two CPUs see modifications in memory by each other. It's the biggest problem affecting Microsoft's emulation of x86 on their Arm-based "Surface" laptops.

4/ So Apple simply cheated. They added Intel's memory-ordering to their CPU. When running translated x86 code, they switch the mode of the CPU to conform to Intel's memory ordering.

另外一個比較有趣的架構是,Apple M1 上面的兩個 core 有不同的架構,一顆對效能最佳化,另外一顆對效率最佳化:

13/ Apple's strategy is to use two processors: one designed to run fast above 3 GHz, and the other to run slow below 2 GHz. Apple calls this their "performance" and "efficiency" processors. Each optimized to be their best at their goal.

在 wikipedia 上的介紹也有提到這兩個 core 的不同,像是 L1 cache 的差異 (128KB 與 192KB),以及功耗的差異:

The M1 has four high-performance "Firestorm" and four energy-efficient "Icestorm" cores, providing a configuration similar to ARM big.LITTLE and Intel's Lakefield processors. This combination allows power-use optimizations not possible with Apple–Intel architecture devices. Apple claims the energy-efficient cores use one tenth the power of the high-performance ones. The high-performance cores have 192 KB of instruction cache and 128 KB of data cache and share a 12 MB L2 cache; the energy-efficient cores have a 128 KB instruction cache, 64 KB data cache, and a shared 4 MB L2 cache. The Icestorm "E cluster" has a frequency of 0.6–2.064 GHz and a maximum power consumption of 1.3 W. The Firestorm "P cluster" has a frequency of 0.6–3.204 GHz and a maximum power consumption of 13.8 W.

再加上其他架構上的改善 (像是針對 JavaScript 的指令集、L1 的提昇,以及用 TSMC 最新製程),累積起來就變成把 Intel 版本壓在地上磨蹭的結果了...

Amazon EC2 增加 T2 instance

Amazon EC2 增加了新的 T2 instance:「New Low Cost EC2 Instances with Burstable Performance」。

T2 系列出了三個等級:t2.micro (1GB)、t2.small (2GB)、t2.medium (4GB)。以 us-east-1 的 t2.micro 價錢來看,只貴 t1.micro 一點點 (USD$0.012/hour 與 USD$0.013/hour),但記憶體大了不少 (640MB 與 1GB)。

另外推出了 CPU Credits 這種計算方式,可以累計 24 小時的 CPU Credits。我在想,AWS 能夠推出這個機制,是已經做到像是 VMware 的 vMotion 之類的不停機遷移嗎?對於在 10Gbps 的 1GB RAM 上的確是不用一秒鐘就可以傳完 RAM 的內容...

CPU Credits 這個機制跟 auto scaling 解決問題的方向有點不太一樣,但也是還不錯的方法... 拿來打組合拳應該還不錯 :p

另外一個比較特別的是在文末有提到對 m1.small 與 m1.medium 的想法。t2.{small,medium} 被認為是 m1.{small,medium} 的接班人 (之一?):

  • t1.micro to t2.micro
  • m1.small to t2.small
  • m1.medium to t2.medium

其中 m3.medium 之前是被認為是 m1.medium 的接班人,看起來雖然都是 General Purpose,但打算多分幾種不同的應用來滿足需求。

可以在 AWS VPC 內開 Micro Instance 了...

之前在 AWS VPC (Virtual Private Cloud) 能開的最小台機器是 m1.small,而 AWS 總算是宣佈可以在 VPC 裡開更小台的 t1.micro 了:「Amazon VPC now Supports Micro Instances」、「Launch EC2 Micro Instances in a Virtual Private Cloud」。

剛好可以拿來當 VPC 的 NAT server 來用...