DHH 一系列從雲端搬回地端的文章

因為 37signals 陸陸續續把所有的服務從雲端搬回地端,DHH 從去年開始寫了一系列相關的文章,開頭應該是這篇:「Why we're leaving the cloud」。

最近的「Cloud exit pays off in performance too」這篇看到個有趣的東西,提到現在的硬體可以很大台,大到可以直接吃整個 Basecamp 服務:

In fact, given that each of our new Dell R7625s have 196 vCPUs, we could actually run the entire Basecamp Classic application, including databases and Redis, on a single such machine! That's just astounding.

其實這也提到了另外一個問題,就是他們的東西一直沒做大,然後也沒有遇到超大的 peak XD

雲端比地端貴是大家都知道的前提 (把網路設備以及 storage 也考慮進來,整體硬體攤提成本來說大概是 1/3~1/4,而頻寬成本會差更多),但即使團隊會管機房,還是會採用雲端的原因 (優點):

  • 節省了採購的人力成本。
  • 讓老闆對於「短時間需要大量的資源」心安。

前者是大家都比較清楚的,在地端如果臨時要一批資料庫等級的機器,通常機房不會有這種等級的機器,需要跟廠商借機器來用,而且裝機再怎麼快也是半天的時機,而雲端上面直接 console 點一點就解決了。

後者是突然爆紅的狀態,幾乎不會發生 (i.e. 做夢),但是會是跟老闆推銷雲端的重要說詞之一;而且就算發生了,常常還是會遇到程式寫的不夠好,無法 scale out 解決問題,甚至是無法 scale up 解決問題。

所以 37signals 把服務都搬回地端是個正確的決定沒錯,但他提到單機就可以解決所有問題,代表量沒大到需要把 sharding 拿出來,其實有點悲傷... XD

AWS 推出了 Graviton3 的機種

Amazon EC2 推出了 Graviton3 的機種:「New Graviton3-Based General Purpose (m7g) and Memory-Optimized (r7g) Amazon EC2 Instances」。

第一波只有一般的 m7g 與記憶體型的 r7g,而計算型的 c7g 大家在 Twitter 上猜應該晚點會放出消息。在去年五月就推出了:「AWS 推出 c7g 機種」。

目前只在歐美的 us-east-1us-east-2us-west-2eu-west-1 區提供,亞洲目前都還沒有這些機種可以用:

M7g and R7g instances are available today in the US East (N. Virginia), US East (Ohio), US West (Oregon), and Europe (Ireland) AWS Regions in On-Demand, Spot, Reserved Instance, and Savings Plan form.

官方宣稱比 Graviton2 的 m6g & r6g 多了 25% 的效能,不過我另外查了一下 us-east-1 上的價錢,也貴了 6% 左右,如果依照官方宣稱的數字計算,大約是 18% 左右的 CP 值提昇,對於有實際上跑滿的 CPU 的人是個不錯的效能提昇:

Today I am happy to tell you about the newest Amazon EC2 instance types, the M7g and the R7g. Both types are powered by the latest generation AWS Graviton3 processors, and are designed to deliver up to 25% better performance than the equivalent sixth-generation (M6g and R6g) instances, making them the best performers in EC2.

裡面有提到在 Graviton3 的一個架構上的大改變是記憶體從 DDR4 變到 DDR5,這使得記憶體的傳輸頻寬提昇了 50%:

Both types of instances are equipped with DDR5 memory, which provides up to 50% higher memory bandwidth than the DDR4 memory used in previous generations.

接下來是看有沒有下放到 t 系列的計畫,像是 t5g 之類的,有的話再用看看好了,不過 blog 這台已經買了三年 RI,等到期間滿了之後說不定都有 Graviton4 或是 Graviton5 了...

風扇的出風孔的遮罩,不同形狀產生噪音的差異

前幾天在 Hacker News Daily 上看的文章:「Effects of grill patterns on fan performance/noise (2011) (pugetsystems.com)」,原文在「Effects of Grill Patterns on Fan Performance/Noise」。文章在講電腦風扇的出風口擋板的樣式對噪音的影響。這是一篇 2011 的老文章,但看了一下好像沒有什麼新資料...

出風口的遮罩在有些地區似乎是法律要求,主要是怕小朋友手指戳進去,以及成年人不小心碰到,所以就會有專門的測試項目在跑:

I remember back in 2000 ish I worked in R&D of a PC manufacturer and had to check the new PSUs and cases with a 'Test Finger' to make sure all the holes were small enough. The test finger was really expensive IIRC.

但回到這篇的重點,噪音與風量的差異,所以基準值是拔掉遮罩的部份先測試 (所以可以知道最低的噪音與最大的風量),然後上各種遮罩測差異。

Mesh 增加的噪音是最少,但風量有點差:

Wire 增加的噪音也很少,風量也達到最高:

從倒數的方向來看,Turbine 是最吵風量也最少的:

發現比較差的遮罩增加的噪音不算少,把很多風的動能變成聲音的能量了...

CPU Core 之間溝通的時間成本

Hacker News 上看到「Measuring CPU core-to-core latency (github.com/nviennot)」這篇,專案在「Measuring CPU core-to-core latency」這裡,看起來是個有趣的研究,測試許多不同 CPU 內,跨 core 之間溝通的時間花費。

依照專案的說明,測試的方式是利用 cache coherence 來來量測:

We measure the latency it takes for a CPU to send a message to another CPU via its cache coherence protocol.

By pinning two threads on two different CPU cores, we can get them to do a bunch of compare-exchange operation, and measure the latency.

裡面已經測了很多不同的 CPU,然後可以看到一些有趣的結果。

像是第一張圖片的「Intel Core i9-12900K @ 8+8 Cores (Alder Lake, 12th gen) 2021-Q4」這組,大家還蠻好奇 CPU #8 到底是怎麼一回事,跨 core 溝通的 latency 特別低,還特別找了 CPU 的 die 圖片看看:

另外一個是 AWS 上的 c6a.metal,機種是「AMD EPYC 7R13 @ 48 Cores (Milan, 3rd gen) 2021-Q1」,可以看到被分成了六個區塊:

接下來在 ARM 平台,在更多 CPU core 的 c7g.16xlarge 上,機種「AWS Graviton3 @ 64 Cores (Arm Neoverse, 3rd gen) 2021-Q4」,會看到更多不平均的現象:

早一點的 c6gd.metal 雖然也還是 ARM 的 64 cores 機種「AWS Graviton2 @ 64 Cores (Arm Neoverse, 2nd gen) 2020-Q1」,但可以看到很不一樣的 latency pattern:

大致上可以感覺到當 core 數愈多就會有很多技術上的瓶頸,導致不同 core 之間的溝通成本不一樣... 這個感覺跟當初學到 NUMA 的情況有點像。

Backblaze 的 2022 Q2 硬碟報告

Backblaze 發表了 2022 Q2 硬碟使用的報告:「Backblaze Drive Stats for Q2 2022」。

第一張是綜合性的資料,從 2013 年到現在還活著的型號都拉出來:

表格裡面好像有些錯誤的地方,像是 SeagateST12000NM0007 這顆的數字就對不太起來,1288 顆但是有 1989 次的 failure,另外 drive days 累計有 35,823,850 days,平均下來是 27813 天 (76 年?),另外 AFR 只有 2.03% (低 1.90%,高 2.10%)。

就算硬碟數量多一個零變成 12880 顆也還是對不太起來,查 datasheet 看起來這個型號是 2017 年出的,到現在也才五年多,76 年的 1/10 變成 7.6 年也對不起來,這個部份看後續會不會更正好了...

另外作者另外把比較有指標性的型號拉出來,可以看到 HGST 在歷史上的表現很好:

然後就算只過濾 2022 Q2 的故障資料,還是可以看到 HGST 在近期的 AFR 表現很不錯:

另外最後提到他會在 DEFCON 30 上聊聊:

If you will be at DEFCON 30 in Las Vegas, I will be speaking live at the Data Duplication Village (DDV) at 1 p.m. on Friday, August 12th. The all-volunteer DDV is located in the lower level of the executive conference center of the Flamingo hotel. We’ll be talking about Drive Stats, SSDs, drive life expectancy, SMART stats, and more. I hope to see you there.

Hacker News 前幾天炸很久的 root cause

前幾天 Hacker News 炸了很久,如果是從 Twitter 上的資料來看,是從 2022/07/08 14:08 UTC 這篇:

中間還原失敗 (2022/07/08 17:35 UTC):

到最後恢復 (2022/07/08 20:48 UTC):

Twitter 這邊的資料看起來差不多是六個小時多,以一個應該是只有 database 需要還原的站台來說的確是蠻久的,所以後續在「HN is up again」這邊就有在討論原因,裡面 HN 的老大 dang 也有提到 downtime 是七個小時多:

8 hours of downtime, but not data loss, since there was no data to lose during the downtime.

Last post before we went down (2022-07-08 12:46:04 UTC): https://news.ycombinator.com/item?id=32026565

First post once we were back up (2022-07-08 20:30:55 UTC): https://news.ycombinator.com/item?id=32026571 (hey, that's this thread! how'd you do that, tpmx?)

So, 7h 45m of downtime. What we don't know is how many posts (or votes, etc.) happened after our last backup, and were therefore lost. The latest vote we have was at 2022-07-08 12:46:05 UTC, which is about the same as the last post.

There can't be many lost posts or votes, though, because I checked HN Search (https://hn.algolia.com/) just before we brought HN back up, and their most recent comment and story were behind ours. That means our last backup on the ill-fated server was taken after the last API update (HN Search relies on our API), and the API gets updated every 30 seconds.

I'm not saying that's a rock-solid argument, but it suggests that 30 seconds is an upper bound on how much data we lost.

另外大家就在找 dang 的回應是什麼 (畢竟是第一手資料),用 Ctrl-F 找一下就看到有趣的猜測,從 32028511 這個節點可以看到這串有趣的討論,首先是 mikeiem

You are never going to guess how long the HN SSDs were in the servers... never ever... OK... I'll tell you: 4.5years. I am not even kidding.

然後是 kabdib 的回應:

Let me narrow my guess: They hit 4 years, 206 days and 16 hours . . . or 40,000 hours.

And that they were sold by HP or Dell, and manufactured by SanDisk.

Do I win a prize?

(None of us win prizes on this one).

接著就是 dang 說他覺得這個猜測很有可能:

Wow. It's possible that you have nailed this.

Edit: here's why I like this theory. I don't believe that the two disks had similar levels of wear, because the primary server would get more writes than the standby, and we switched between the two so rarely. The idea that they would have failed within hours of each other because of wear doesn't seem plausible.

But the two servers were set up at the same time, and it's possible that the two SSDs had been manufactured around the same time (same make and model). The idea that they hit the 40,000 hour mark within a few hours of each other seems entirely plausible.

Mike of M5 (mikiem in this thread) told us today that it "smelled like a timing issue" to him, and that is squarely in this territory.

後續他也從自家的 /newest 裡面撈了相關的資料出來,依照他撈出來的關鍵字,看起來是用 HPE 出的 SSD:

It's also an example of the dharma of /newest – the rising and falling away of stories that get no attention:

HPE releases urgent fix to stop enterprise SSDs conking out at 40K hours - https://news.ycombinator.com/item?id=22706968 - March 2020 (0 comments)

HPE SSD flaw will brick hardware after 40k hours - https://news.ycombinator.com/item?id=22697758 - March 2020 (0 comments)

Some HP Enterprise SSD will brick after 40000 hours without update - https://news.ycombinator.com/item?id=22697001 - March 2020 (1 comment)

HPE Warns of New Firmware Flaw That Bricks SSDs After 40k Hours of Use - https://news.ycombinator.com/item?id=22692611 - March 2020 (0 comments)

HPE Warns of New Bug That Kills SSD Drives After 40k Hours - https://news.ycombinator.com/item?id=22680420 - March 2020 (0 comments)

(there's also https://news.ycombinator.com/item?id=32035934, but that was submitted today)

這次 downtime 看起來很像是中了 SSD firmware bug,目前看起來先搬到 EC2 上面了:

$ host news.ycombinator.com
news.ycombinator.com has address 50.112.136.166
$ host 50.112.136.166      
166.136.112.50.in-addr.arpa domain name pointer ec2-50-112-136-166.us-west-2.compute.amazonaws.com.

看討論串應該是暫時性的?

Amazon EC2 有 Mac (M1) 機種可以租用了

2020 年年底的時候 AWS 推出用 Mac mini 配合搭建出 Mac (Intel) 機種:「Amazon EC2 推出 Mac Instance」,當初有計畫在 2021 年推出 M1 的版本:

Apple M1 Chip – EC2 Mac instances with the Apple M1 chip are already in the works, and planned for 2021.

不過就沒什麼意外的 delay 了,這次則是推出了 M1 的版本:「New – Amazon EC2 M1 Mac Instances」。

依照說明看起來還是 Mac mini,掛上 AWS Nitro System

EC2 Mac instances are dedicated Mac mini computers attached through Thunderbolt to the AWS Nitro System, which lets the Mac mini appear and behave like another EC2 instance.

然後跟 Intel 版本一樣,因為是掛進 Dedicated Hosts 的計價方式,雖然是以秒計費,但還是設定最低 24 小時的租用時間限制:

Amazon EC2 Mac instances are available as Dedicated Hosts through both On Demand and Savings Plans pricing models. The Dedicated Host is the unit of billing for EC2 Mac instances. Billing is per second, with a 24-hour minimum allocation period for the Dedicated Host to comply with the Apple macOS Software License Agreement. At the end of the 24-hour minimum allocation period, the host can be released at any time with no further commitment.

Intel 版本代號是 mac1,只有一種機種 mac1.metal,M1 版本代號是 mac2,也只有一種機種 mac2.metal

以最經典的美東一區 us-east-1 來看,mac1.metal 的 on-demand 價錢是 US$1.083/hour,mac2.metal 則是 US$0.65/hour,差不多是 60% 的價錢,便宜不少,大概是反應在硬體攤提與電費成本上了。

另外目前大家用 M1 的經驗來看,Rostta 2 未必會比原生的機器慢多少,雖然 mac1.metal 是 12 cores,mac2.metal 是 8 core,但以雲上面一定要用 Mac 跑的應用來說,馬上想的到的還是綁在 Apple 環境裡 CI 類的應用?

目前看起來主要的問題還是 24 小時的最小計費單位讓彈性低不少...

Amazon EC2 支援 NitroTPM 與 UEFI Secure Boot

也是在清 RSS reader 的時候翻到的公告,在兩個禮拜前 AWS 宣佈 Amazon EC2 支援 NitroTPMUEFI Secure Boot:「Amazon EC2 Now Supports NitroTPM and UEFI Secure Boot」。

NitroTPM 相容於 TPM 2.0 的界面,所以有支援 TPM 2.0 的軟體都可以利用 (像是 Windows 11):

Nitro Trusted Platform Module (NitroTPM) is a virtual device that is provided by the AWS Nitro System and conforms to the TPM 2.0 specification.

之前在研究 LUKS 的時候也有看到 TPM 相關的資料:

Linux Unified Key Setup (LUKS) or dm-verity on Linux are examples of OS-level applications that can leverage NitroTPM too.

然後支援的平台有些限制,只有 IntelAMD 的平台有支援,而且還要扣掉 Xen、Mac 以及 bare metal 的機種:

At the moment, we support all Intel and AMD instance types that supports UEFI boot mode. Graviton1, Graviton2, Xen-based, Mac, and bare-metal instances are not supported.

ARM 那邊有自己的一套,不太玩 TPM 大概可以理解,Xen 大概是不想支援 (停止開發新功能之類的原因),Mac 可能是 Apple 的硬體限制,最後的 bare metal 是因為沒辦法虛擬化?

然後這個功能不另外收費,看起來幾乎是全球性一次更新:

There is no additional cost for using NitroTPM. It is available today in all AWS Regions, including the AWS GovCloud (US) Regions, except in China.

AWS 推出 c7g 機種

AWSAmazon EC2 產品線上推出了新一代的 ARM 產品,AWS Graviton3 架構,c7g 系列機種:「New – Amazon EC2 C7g Instances, Powered by AWS Graviton3 Processors」。

Graviton3 宣稱比 Graviton2 多 25% 的一般性效能,以及多了一倍的浮點效能,還有 DDR5 的頻寬優勢:

Our next generation, Graviton3 processors, deliver up to 25 percent higher performance, up to 2x higher floating-point performance, and 50 percent faster memory access based on leading-edge DDR5 memory technology compared with Graviton2 processors.

在「FreeBSD on the Graviton 3」這邊也可以看到一些效能數據 (雖然是跑在 FreeBSD 上),可以看到基本上符合 AWS 的宣稱。

目前只有 us-east-1us-west-2 有這個機種,不過 Graviton 系列一直都是 AWS 的強主打項目,其他區域應該很快就會看到:

C7g instances are initially available in US East (N. Virginia) and US West (Oregon) AWS Regions; other Regions will be added shortly after launch.

價位上比 c6g 貴了一些,在 us-east-1c7g.16xlarge (64 vCPU + 128 GB RAM) 是 US$2.32/hour,而 c6g.16xlarge 是 US$2.176/hour,大約貴了 6.6%,但如果是 CPU bounded 的應用來說,應該還是蠻划算的。

另外一方面是等後續的 m7gr7g 系列出現...

關於 Hacker News 上在討論 Microsoft Teams 的一些有趣的發現

Hacker News 上看到「Ask HN: Why is MS Teams so slow, do devs test Teams on less powerful machines?」這串討論,發起的作者提到了超級吃資源的問題:

I have a laptop with an i5 processor and 8G of RAM. Hard drive is an SSD. It sometimes takes me a full minute and a half to get Teams open and ready to join a meeting. It is driving me crazy.

可以預期下面本來就會有一堆人在抱怨,不過意外看到有趣的回應:

Ex office dev here. Actual dev work is done on i9 workstations running 64 gb of ram, and usually located very near an Azure data center, regardless of where the dev works. The result is that it's fast for us.

Everyone knows that it runs like poop, but there are other priorities, and no performance regression tests.

所以你應該只要電腦夠暴力 + 離機房夠近就會快了 XDDD

的確像是最上面的 comment 提到的,開發者因為都用 MBP,至少還能動 (因為機器夠快),但其他團隊則是一直幹勦:

At my company, the developers are all on fairly powerful MacBook Pros, and everyone else in the company has Windows laptops (I think generally Surface devices)

For the developers, Teams works... as good as Teams can. So not great, but it works most of the time (for me, anyway). For everyone else though, I hear nothing but issues. Constantly having to restart to make Teams work. And again, this is on Surface devices, so Microsoft is making the app, the OS, and the hardware!