arm – Page 2 – Gea-Suan Lin's BLOG

在 AWS 上面的 OpenVPN Server 效能

這篇的後續可以參考「Amazon EC2 的網路效能」這篇。

最近在在調整跑在 Amazon EC2 上 OpenVPN server 的效能，要想辦法把 network throughput 拉高，當作在導入 WireGuard 之前的 workaround，但看起來還是頗有用，記錄一下可以調整的部份...

在還沒灌大量流量前是用 t3a.nano (開 Unlimited mode)，然後會觀察到的瓶頸是 OpenVPN 的 daemon 吃了 100% CPU loading，最高速度卡在 42MB/sec 左右。

第一個想到的是看看 OpenVPN server 有沒有可以使用多 CPU 的方式，但查了資料發現 OpenVPN server 無法使用 threading 或是 fork 之類的方法善用多顆 CPU，所以就開始想其他方法...

接著看到我們目前用的是 AES-256-CBC 了，網路上很多文章都有提到 AES-128-CBC 會快一些，但我們的 OpenVPN client 已經是設死都用 AES-256-CBC 了，這個就沒辦法了...

而第一個可行的解法是把 AMD-based 的 t3a.nano 換成 ARM-based 的 t4g.nano，還是 100% 的 CPU loading，但直接多了 50%+ 的效能，到了 69MB/sec。

第二個解法是找資料時發現的 fast-io 參數，加上去以後可以再快一些，到 77MB/sec。

有了這兩個 workaround 應該就堪用了，接下來是發現在傳大量資料跑一陣子後速度會掉下來，於是開了兩台 t4g.nano 用 iperf 對測了一下，發現會逐步掉速：

前 15 秒可以直接到 5Gbps，就是 AWS 網頁上宣稱的最高速度，接下來降到 800Mbps 左右。
到 180 秒左右後降到 300Mbps。
到 210 秒左右後回到 800Mbps。
到 300 秒左右後降到 500Mbps。
到 300 秒左右後降到 300Mbps。
到 1260 秒左右後降到 30Mbps，後面就一直維持這個速度了。

看起來 network bandwidth credit 是分階段的，但 30Mbps 真的有點低...

在換成四倍大的 t4g.small 測試後發現也只能到 40MB/sec 左右 (比較疑惑的是，居然不是四倍？)，目前上了 c6g.medium，但看起來網路的部份也還是有瓶頸，在 46MB/sec 左右，要再想一下下一步要怎麼調整...

但以目前看到的情況總結，如果能用 ARM 架構就儘量用，效率與價錢真的是好 x86-64 不少...

Cloudflare 開始在正式環境用 ARM server 了

在「Designing Edge Servers with Arm CPUs to Deliver 57% More Performance Per Watt」這邊 Cloudflare 提到了他們在正式環境用 ARM 架構了：

Our first Arm CPU was deployed in production earlier this month — July 2021.

記得測了很多年，其中遇到測試到一半看起來還不錯，但原廠商決定不繼續做的，直到後來又有廠商投入，到現在總算是有比較成熟的產品可以用。

隔壁棚 AWS 上的 ARM 伺服器用起來也是香到不行，還沒有用過的可以試看看，至少我這台 blog & wiki 也都是跑在上面。

另外文章裡有提到目前 x86 的效能，新一代的 AMD 大概只比前一代多了 39% 的每瓦效能，但如果是把 ARM 拿進來比的話會到 57%：

Our most recently deployed generation of edge servers, Gen X, used AMD Rome CPUs. Compared with that, the newest Arm based CPUs process an incredible 57% more Internet requests per watt. While AMD has a sequel, Milan (and which Cloudflare will also be deploying), it doesn’t achieve the same degree of energy efficiency that the Arm processor does — managing only 39% more requests per watt than Rome CPUs in our existing fleet.

開始推上 production 後應該會愈換愈快，而且代表 Cloudflare 也會開始針對 ARM 平台最佳化。

AWS 推出 X2gd 機種，針對記憶體再提供更便宜的方案

AWS 推出了 X2gd 機種，用 ARM 的 CPU，然後給更少顆，disk 也更小，但換來的就是價錢更低：「New Amazon EC2 X2gd Instances – Graviton2 Power for Memory-Intensive Workloads」。

把兩個用 ARM 的主機拿出來看看 us-east-1 的價錢，第一個是這次的 x2gd.medium，只有 1 vCPU + 59 GB SSD，但有 16 GB RAM，現在的價錢是 $0.0835/hr。

另外一個 r6g.large 則是 2 vCPU + 16 GB RAM，然後 EBS only，則是 $0.1008/hr。

再來是 Intel 的 x1e.xlarge，這邊是 4 vCPU + 12 GB RAM + 120 GB SSD，單價也差不多，不過記憶體少了點，$0.834/hr。

另外 Intel 也有 r5.large，2 vCPU + 16 GB RAM + EBS only，$0.126/hr。

最後一個是 AMD 的 r5a.large，跟 Intel 的 r5.large 也很像，2 vCPU + 16 GB RAM + EBS only，$0.113/hr。

這次推出的 X2gd 機種提供了只要記憶體的極端選擇，而且依照先前的經驗，Graviton2 真的很快，1 vCPU 未必會不夠用... 至少我 blog 的 PHP 與 MariaDB 都是跑在 t4g 上面，看起來比之前放在 VPS 上快不少 :o

Cloudflare 再次嘗試 ARM 伺服器

2018 年的時候寫過一篇 Cloudflare 在嘗試 ARM 伺服器的進展：「Cloudflare 用 ARM 當伺服器的進展...」，後來就沒有太多公開的消息，直到這幾天看到「ARMs Race: Ampere Altra takes on the AWS Graviton2」才看到原因：

By the time we completed porting our software stack to be compatible with ARM, Qualcomm decided to exit the server business.

所以是都測差不多，也都把 Cloudflare 自家的軟體搬上去了，但 Qualcomm 也決定收手，沒機器可以用...

這次再次踏入 ARM 領域讓人想到前陣子 Apple 的 M1，讓大家看到 ARM 踏入桌機與筆電領域可以是什麼樣貌...

這次 Cloudflare 選擇了 Ampere Altra，這是基於 Neoverse N1 的平台，而這個平台的另外一個知名公司就是 AWS 的 Graviton2，所以就拿來比較：

可以看到 Ampere Altra 的核心數多了 25% (64 vs. 80)，運作頻率多了 20% (2.5Ghz vs. 3.0Ghz)。測試的結果也都有高有低，落在 10%～40% 都有。

不過其中比較特別的是 Brotli - 9 的測試特別差 (而且是 8 與 10 都正常的情況下)：

依照 Cloudflare 的說法，他們其實不會用到 Brotli - 7 以及更高的等級，不過畢竟有測出來，還是花了時間找一下根本原因：

Although we do not use Brotli level 7 and above when performing dynamic compression, we decided to investigate further.

反追問題後發現跟 Page Faults 以及 Pipeline Backend Stalls 有關，不過是可以改寫避開，在避開後可以達到跟 Graviton2 類似的水準：

By analyzing our dataset further, we found the common underlying cause appeared to be the high number of page faults incurred at level 9. Ampere has demonstrated that by increasing the page size from 4K to 64K bytes, we can alleviate the bottleneck and bring the Ampere Altra at parity with the AWS Graviton2. We plan to experiment with large page sizes in the future as we continue to evaluate Altra.

但目前看起來應該都還算正向，看起來供貨如果穩定的話，應該有機會換過去？畢竟 ARM 平台可以省下來的電力太多了，現在因為 M1 對 ARM 的公關效果太驚人的關係，解釋起來會更輕鬆...

把 blog 從 t4g.small 降到 t4g.micro

我在「把 blog 搬到 t4g.small 上」這邊有提到把這個 blog 搬到 Amazon EC2 的 t4g.small 上 (2GB RAM + 20% CPU credit)，跑了一陣子把 CPU usage 拉出來看：

當初估大約要 20% 的 CPU credit，結果發現 CPU credit 大概用 5% 就夠了。另外記憶體的部份大約要給 1GB，這個量可以看出來一些沒在用的 process 會被丟到 swap：

              total        used        free      shared  buff/cache   available
Mem:          952Mi       380Mi        79Mi       110Mi       492Mi       368Mi
Swap:         511Mi       152Mi       359Mi

把條件綜合起來計算，就往下降一階變成 t4g.micro 了 (1GB RAM + 10% CPU credit)。

另外新機種比較不用擔心淘汰速度，就看了一下 Reserved Instances 的價錢，一年 USD$44，三年 USD$84，看起來只要有用兩年就算是 OK，直接買三年解決掉...

把 blog 搬到 t4g.small 上

算了一下成本還可以接受 (機器 + 空間 + 流量)，就把 blog 搬到 AWS 的 t4g.small (ARM) 上，理論上頁面的速度應該會快不少，過幾天等穩定性沒問題後就來買 RI...

從 x86-64 轉到 ARM 上面，主要是 Percona Server 目前沒有提供 ARM binary 的 apt repository，所以就改用 MariaDB 了。

其他的倒是都差不多，目前的 Ubuntu + nginx + PHP 沒什麼問題，跑一陣子看看...

AWS 的 T4g 系列機器增加服務區域

先前在「AWS 推出了 ARM 平台上 T 系列的機器」這邊提到 Amazon EC2 推出採用 ARM 系列的 t4g.*，當時亞洲區只有東京與孟買可以使用，現在在更多區域都推出上線了：「Announcing new Amazon EC2 T4g instances powered by AWS Graviton2 processors along with a T4g free trial in Asia Pacific (Sydney, Singapore), Europe (London), North Americas (Canada Central, San Francisco), and South Americas (Sao Paulo) regions」。

抓了一下新加坡的價錢：

t4g.nano	2	N/A	0.5 GiB	EBS Only	$0.0053 per Hour
t3.nano	2	Variable	0.5 GiB	EBS Only	$0.0066 per Hour
t3a.nano	2	Variable	0.5 GiB	EBS Only	$0.0059 per Hour

可以來測一些東西看看如何了...

Raspberry Pi 推出 MCU 產品 Pico

Raspberry Pi 推出了新的產品線 Raspberry Pi Pico，基於 ARM 架構的 MCU (microcontroller unit)，Hacker News 上也有不少討論：「Raspberry Pi Pico and RP2040 Microcontroller (raspberrypi.org)」。

與 Raspberry Pi 的 Zero 系列比起來，Zero 上面有 512MB RAM，不算大但還是可以跑一個完整的 OS。而 Pico 的定位就直接是 MCU 了，基本上就是特製的系統。

再來是看 Pico 的規格，比隔壁棚的 ESP32 低一些，但可以看出來 Pico 主要的定位還是教學：高速雙核心 CPU、多腳位、大的記憶體以及 flash (對於一般量產的 MCU 來說)，支援透過 Micro-USB 的 5V 供電界面方便開發與使用。

官方支援 C 與 MicroPython 兩種開發使用的模式，其中在 Hacker News 上有人抱怨 MicroPython 哭爸慢，不過這點應該還好，要在上面跑比較複雜的人會自己跳到 C 上面開發。

最後來講價位，目前定價是每顆 USD$4，當然跟一般商用量產的 MCU 差不少 (會盡可能壓低到「剛好夠用」的等級，價錢就有可能低到 USD$0.x)，但相比於現在在教學上很紅的 ESP32 (大約在 USD$10 附近) 來說，等於是殺出另外一條產品線，加上挾帶著 Raspberry Pi 的知名度來說，等於是讓更多人熟悉 MCU...

Amazon EC2 的新機種：R5b、D3 (D3en)、C6gn、M5zn、G4ad

Amazon EC2 除了昨天放出 Mac mini 消息打頭陣以外，其他機種的更新消息也陸陸續續公佈了：

比較有趣的 (對我而言)，第一個是 ARM 架構的機器也推出 100Gbps 的 n 版本 c6gn，看起來很適合跑大流量的東西，馬上想到的就是自架的 memcached？

另外是 m5zn，使用高頻率的 Intel Xeon，主打需要單核效率的程式，不過這是掛在 m 系列下，而不是 c 系列...

再來是使用 AMD GPU 的 g4ad，官方宣稱跟 NVIDIA 的 g4dn 比起來，將會有 45% 的 C/P 值提昇，是個蘇媽跟老黃的對決：

However, when compared to G4dn the new G4ad instances enable up to 45% better price performance for graphics-intensive workloads, including the aforementioned game streaming, remote graphics workstations, and rendering scenarios. Compared to an equally-sized G4dn instance, G4ad instances offer up to 40% improvement in performance.

看起來 ARM 的消息沒有想像中的多...

Apple M1 的效能與省電原因

在 Hacker News Daily 上看到 Apple M1 為什麼這麼快又省電的解釋，可以當作一種看法：

1/ In case you were wondering: Apple's replacement for Intel processors turns out to work really, really well. Some otherwise skeptical techies are calling it "black magic". It runs Intel code extraordinarily well.

— Robᵉʳᵗ Graham?, provocateur (@ErrataRob) November 25, 2020

可以在 Thread reader 上面讀：「Thread by @ErrataRob on Thread Reader App – Thread Reader App」。

看起來 Apple 在規劃的時候就有考慮 x86 模擬問題，所以在記憶體架構上直接實做了對應的模式，大幅降低了當年 Microsoft 在 Surface 上遇到的問題：

3/ The biggest hurdle was "memory-ordering", the order in which two CPUs see modifications in memory by each other. It's the biggest problem affecting Microsoft's emulation of x86 on their Arm-based "Surface" laptops.

4/ So Apple simply cheated. They added Intel's memory-ordering to their CPU. When running translated x86 code, they switch the mode of the CPU to conform to Intel's memory ordering.

另外一個比較有趣的架構是，Apple M1 上面的兩個 core 有不同的架構，一顆對效能最佳化，另外一顆對效率最佳化：

13/ Apple's strategy is to use two processors: one designed to run fast above 3 GHz, and the other to run slow below 2 GHz. Apple calls this their "performance" and "efficiency" processors. Each optimized to be their best at their goal.

在 wikipedia 上的介紹也有提到這兩個 core 的不同，像是 L1 cache 的差異 (128KB 與 192KB)，以及功耗的差異：

The M1 has four high-performance "Firestorm" and four energy-efficient "Icestorm" cores, providing a configuration similar to ARM big.LITTLE and Intel's Lakefield processors. This combination allows power-use optimizations not possible with Apple–Intel architecture devices. Apple claims the energy-efficient cores use one tenth the power of the high-performance ones. The high-performance cores have 192 KB of instruction cache and 128 KB of data cache and share a 12 MB L2 cache; the energy-efficient cores have a 128 KB instruction cache, 64 KB data cache, and a shared 4 MB L2 cache. The Icestorm "E cluster" has a frequency of 0.6–2.064 GHz and a maximum power consumption of 1.3 W. The Firestorm "P cluster" has a frequency of 0.6–3.204 GHz and a maximum power consumption of 13.8 W.

再加上其他架構上的改善 (像是針對 JavaScript 的指令集、L1 的提昇，以及用 TSMC 最新製程)，累積起來就變成把 Intel 版本壓在地上磨蹭的結果了...