單 CPU 的伺服器是四個 100Gbps 界面接出來,雙 CPU 的伺服器是八個 (這邊 SUT 是 system under test 的縮寫):
These client systems were connected to the CDN servers using 100 GbE links through a switch; 4x100 GbE connections for the single-processor SUT, and 8x100 GbE for the dualprocessor SUT. Testing was done using Wrk, a widely recognized open-source HTTP(S) benchmarking tool.
在白皮書最後面也有提到測試的配置,都是在 Ubuntu 20.04 上面跑,單 CPU 用的是兩張 Intel 的 100Gbps 網卡,雙 CPU 的用的是四張 Mellanox 的 100Gbps 網卡:
3rd generation Intel Xeon Scalable testing done by Intel in September 2021. Single processor SUT configuration was based on the Supermicro SMC 110P-WTR-TNR single socket server based on Intel® Xeon® Platinum 8380 processor (microcode: 0xd000280) with 40 cores operating at 2.3 GHz. The server featured 256 GB of RAM. Intel® Hyper-Threading Technology was enabled, as was Intel® Turbo Boost Technology 2.0. Platform controller hub was the Intel C620. NUMA balancing was enabled. BIOS version was 1.1. Network connectivity was provided by two 100 GbE Intel® Ethernet Network Adapters E810. 1.2 TB of boot storage was available via an Intel SSD. Application storage totaled 3.84TB per drive and was provided by 8 Intel P5510 SSDs. The operating system was Ubuntu Linux release 20.04 LTS with kernel 5.4.0-80 generic. Compiler GCC was version 9.3.0. The workload was wrk/master (April 17, 2019), and the version of Varnish was varnishplus-6.0.8r3. Openssl v1.1.1h was also used. All traffic from clients to SUT was encrypted via TLS.
3rd generation Intel Xeon Scalable testing done by Intel in September 2021. Dual processor SUT configuration was based on the Supermicro SMC 22OU-TNR dual socket server based on Intel® Xeon® Platinum 8380 processor (microcode: 0xd000280) with 40 cores operating at 2.3 GHz. The server featured 256 GB of RAM. Intel® Hyper-Threading Technology was enabled, as was Intel® Turbo Boost Technology 2.0. Platform controller hub was the Intel C620. NUMA balancing was enabled. BIOS version was 1.1. Network connectivity was provided by four 100 GbE Mellanox MCX516A-CDAT adapters. 1.2 TB of boot storage was available via an Intel SSD. Application storage totaled 3.84TB per drive and was provided by 12 Intel P5510 SSDs. The operating system was Ubuntu Linux release 20.04 LTS with kernel 5.4.0-80- generic. Compiler GCC was version 9.3.0. The workload was wrk/master (April 17, 2019), and the version of Varnish was varnish-plus6.0.8r3. Openssl v1.1.1h was also used. All traffic from clients to SUT was encrypted via TLS.
Per-object metadata is stored within a discrete S3 subsystem. This system is on the data path for GET, PUT, and DELETE requests, and is responsible for handling LIST and HEAD requests. At the core of this system is a persistence tier that stores metadata. Our persistence tier uses a caching technology that is designed to be highly resilient. S3 requests should still succeed even if infrastructure supporting the cache becomes impaired. This meant that, on rare occasions, writes might flow through one part of cache infrastructure while reads end up querying another. This was the primary source of S3’s eventual consistency.
如果要解決 Eventually Consistent,最直接的想法是拔掉 cache,但這樣對效能的影響太大,所以得在要保留 cache 的情況下設計,所以就想到用其他管道確保 cache 裡的資料狀態是正確的:
One early consideration for delivering strong consistency was to bypass our caching infrastructure and send requests directly to the persistence layer. But this wouldn’t meet our bar for no tradeoffs on performance. We needed to keep the cache. To keep values properly synchronized across cores, CPUs implement cache coherence protocols. And that’s what we needed here: a cache coherence protocol for our metadata caches that allowed strong consistency for all requests.
We had introduced new replication logic into our persistence tier that acts as a building block for our at-least-once event notification delivery system and our Replication Time Control feature. This new replication logic allows us to reason about the “order of operations” per-object in S3. This is the core piece of our cache coherency protocol.
These verification techniques were a lot of work. They were more work, in fact, than the actual implementation itself. But we put this rigor into the design and implementation of S3’s strong consistency because that is what our customers need.
攻擊手法是先發一個 DNS query 到 DNS resolver,然後馬上再送出一個偽造的 DNS response packet 給 DNS resolver 收,運氣好的話這個偽造的結果就會被 cache 起來。
記得當年的 168.95.1.1 與 168.95.192.1 沒有太直接受到影響 (相較於其他的 DNS resolver),是因為這兩個 DNS resolver 後面有 server cluster 會打散流量,所以未必能猜對去查詢時用的 DNS server 所使用的 IP address,有點類似下面提到的緩解方案 (只是沒那麼有效)。
而記得後來的緩解方式是透過亂數化 source port (讓 DNS resolver 查詢時不要從 port 53 出去問),這個方式讓攻擊機率大幅下降 (大約降到 的機率)。
後來 DNS 加上 nonce 機制再繼續壓低攻擊成功的機率 (再降一次 ,變成大約 ),最後則是 DNSSEC 的支援度逐漸普及,才解決掉這個問題。
The demonstration website can leak data at a speed of 1kB/s when running on Chrome 88 on an Intel Skylake CPU. Note that the code will likely require minor modifications to apply to other CPUs or browser versions; however, in our tests the attack was successful on several other processors, including the Apple M1 ARM CPU, without any major changes.
While experimenting, we also developed other PoCs with different properties. Some examples include:
A PoC which can leak 8kB/s of data at a cost of reduced stability using performance.now() as a timer with 5μs precision.
A PoC which leaks data at 60B/s using timers with a precision of 1ms or worse.
比較苦的消息是 Google 已經確認在軟體層沒辦法解乾淨,目前在瀏覽器上只能靠各種 isolation 降低風險,像是將不同站台跑在不同的 process 裡面:
In 2019, the team responsible for V8, Chrome’s JavaScript engine, published a blog post and whitepaper concluding that such attacks can’t be reliably mitigated at the software level. Instead, robust solutions to these issues require security boundaries in applications such as web browsers to be aligned with low-level primitives, for example process-based isolation.
Apple M1 也中這件事情讓人比較意外一點,看起來是當初開發的時候沒評估?目前傳言的 M1x 與 M2 不知道會怎樣...
丟進 array 是 OK 的,但問題在於他需要判斷 entry 是否重複,卻沒有用 hash 或是 tree 的結構,而這邊大約有 63k 筆資料,用 array 實做就產生了 O(n^2) 的演算法:
But before it’s stored? It checks the entire array, one by one, comparing the hash of the item to see if it’s in the list or not. With ~63k entries that’s (n^2+n)/2 = (63000^2+63000)/2 = 1984531500 checks if my math is right. Most of them useless. You have unique hashes why not use a hash map.
if it’s called again within the string’s range, return cached value
而第二個問題他直接把檢查是否有重複的跳過,因為資料本身不重複:
And as for the hash-array problem, it’s more straightforward - just skip the duplicate checks entirely and insert the items directly since we know the values are unique.
I found this while making a collection of what C implementation does what at https://news.ycombinator.com/item?id=26298300.
There are two basic implementation strategies. The BSD (FreeBSD and OpenBSD and more than likely NetBSD too), Microsoft, GNU, and MUSL C libraries use one, and suffer from this; whereas the OpenWatcom, P.J. Plauger, Tru64 Unix, and my standard C libraries use another, and do not.
The 2002 report in the comp.lang.c Usenet newsgroup (listed in that discussion) is the earliest that I've found so far.