Backblaze 的 SSD failure rate 資料

Backblaze 整理了 SSD failure rate 的資料:「The SSD Edition: 2022 Drive Stats Review」,裡面比較有興趣的是歷史資料這部份:

SSD 用在系統碟的關係,數量沒像 HDD 那麼多,所以有些信心區間的值會差異很大。

裡面比較亮眼的是 DELLBOSS VD,用的數量不算少,而且看平均使用時間應該是比 MX500 多了一倍多,但都還沒有掛掉的記錄,所以 failure rate 就算是信心區間的上限值都還是很漂亮。

然後用最多得是 Seagate 的 SSD,平均使用時間又比 Dell 那批更長了。

Backblaze 對 SSD 存活率的報告

Backblaze 發了一篇對 SSD 存活率的報告:「The SSD Edition: 2022 Drive Stats Mid-year Review」,報告分成兩大塊,一塊是單講 SSD 的,另外一塊是跟傳統磁頭硬碟 HDD 比較。

首先是這張總表,從 2018 年到現在的 SSD 硬碟的 AFR 資料:

可以看到有特地標出信賴區間,因為對於某些量真的太少的型號,算出來的 AFR 沒有太大意義,所以重點只有在幾個數量比較多的型號。

Seagate 的 ZA250CM10003 最多,AFR 是 0.70% (CI 在 0.3%-1.3%);接下來是 Seagate 的 ZA250CM10002,AFR 是 0.78% (CI 在 0.4%-1.4%)。

第三多的是 Dell 的 DELLBOSS VD,AFR 是 0% (CI 在 0.0%-0.8%),不過要注意這是 M.2 界面,而且是 server 等級:

It is a server-class drive in an M.2 form factor, but it might be out of the price range for many of us as it currently sells from Dell for $468.65.

接下來是比較 SSD 與 HDD。這邊的比較中,兩者都是相同的用途 (開機碟 & 系統碟):

The SSDs and HDDs we are reporting on are all boot drives. They perform the same functions: booting the storage servers, recording log files, acting as temporary storage for SMART stats, and so on.

因為 SSD 目前只有五年的記錄,可以看到如果只比較五年的話,SSD 的 AFR 是比 HDD 好上不少的:

不過這邊還是以機房環境來說明,像是機櫃的振動與使用的 pattern 都是可以想到的因素。一般的情況下,如果沒有一堆 HDD 在 JBOD 裡面振動的話,是不是可以活比較久就不知道了...

但現在開機碟用 HDD 應該也會開到天花地老,好像也沒有什麼特別的理由會換回 HDD...

Hacker News 前幾天炸很久的 root cause

前幾天 Hacker News 炸了很久,如果是從 Twitter 上的資料來看,是從 2022/07/08 14:08 UTC 這篇:

中間還原失敗 (2022/07/08 17:35 UTC):

到最後恢復 (2022/07/08 20:48 UTC):

Twitter 這邊的資料看起來差不多是六個小時多,以一個應該是只有 database 需要還原的站台來說的確是蠻久的,所以後續在「HN is up again」這邊就有在討論原因,裡面 HN 的老大 dang 也有提到 downtime 是七個小時多:

8 hours of downtime, but not data loss, since there was no data to lose during the downtime.

Last post before we went down (2022-07-08 12:46:04 UTC): https://news.ycombinator.com/item?id=32026565

First post once we were back up (2022-07-08 20:30:55 UTC): https://news.ycombinator.com/item?id=32026571 (hey, that's this thread! how'd you do that, tpmx?)

So, 7h 45m of downtime. What we don't know is how many posts (or votes, etc.) happened after our last backup, and were therefore lost. The latest vote we have was at 2022-07-08 12:46:05 UTC, which is about the same as the last post.

There can't be many lost posts or votes, though, because I checked HN Search (https://hn.algolia.com/) just before we brought HN back up, and their most recent comment and story were behind ours. That means our last backup on the ill-fated server was taken after the last API update (HN Search relies on our API), and the API gets updated every 30 seconds.

I'm not saying that's a rock-solid argument, but it suggests that 30 seconds is an upper bound on how much data we lost.

另外大家就在找 dang 的回應是什麼 (畢竟是第一手資料),用 Ctrl-F 找一下就看到有趣的猜測,從 32028511 這個節點可以看到這串有趣的討論,首先是 mikeiem

You are never going to guess how long the HN SSDs were in the servers... never ever... OK... I'll tell you: 4.5years. I am not even kidding.

然後是 kabdib 的回應:

Let me narrow my guess: They hit 4 years, 206 days and 16 hours . . . or 40,000 hours.

And that they were sold by HP or Dell, and manufactured by SanDisk.

Do I win a prize?

(None of us win prizes on this one).

接著就是 dang 說他覺得這個猜測很有可能:

Wow. It's possible that you have nailed this.

Edit: here's why I like this theory. I don't believe that the two disks had similar levels of wear, because the primary server would get more writes than the standby, and we switched between the two so rarely. The idea that they would have failed within hours of each other because of wear doesn't seem plausible.

But the two servers were set up at the same time, and it's possible that the two SSDs had been manufactured around the same time (same make and model). The idea that they hit the 40,000 hour mark within a few hours of each other seems entirely plausible.

Mike of M5 (mikiem in this thread) told us today that it "smelled like a timing issue" to him, and that is squarely in this territory.

後續他也從自家的 /newest 裡面撈了相關的資料出來,依照他撈出來的關鍵字,看起來是用 HPE 出的 SSD:

It's also an example of the dharma of /newest – the rising and falling away of stories that get no attention:

HPE releases urgent fix to stop enterprise SSDs conking out at 40K hours - https://news.ycombinator.com/item?id=22706968 - March 2020 (0 comments)

HPE SSD flaw will brick hardware after 40k hours - https://news.ycombinator.com/item?id=22697758 - March 2020 (0 comments)

Some HP Enterprise SSD will brick after 40000 hours without update - https://news.ycombinator.com/item?id=22697001 - March 2020 (1 comment)

HPE Warns of New Firmware Flaw That Bricks SSDs After 40k Hours of Use - https://news.ycombinator.com/item?id=22692611 - March 2020 (0 comments)

HPE Warns of New Bug That Kills SSD Drives After 40k Hours - https://news.ycombinator.com/item?id=22680420 - March 2020 (0 comments)

(there's also https://news.ycombinator.com/item?id=32035934, but that was submitted today)

這次 downtime 看起來很像是中了 SSD firmware bug,目前看起來先搬到 EC2 上面了:

$ host news.ycombinator.com
news.ycombinator.com has address 50.112.136.166
$ host 50.112.136.166      
166.136.112.50.in-addr.arpa domain name pointer ec2-50-112-136-166.us-west-2.compute.amazonaws.com.

看討論串應該是暫時性的?

用 Ephemeral Storage 加速 MySQL over ZFS 的效能

Percona 的「MySQL/ZFS in the Cloud, Leveraging Ephemeral Storage」這篇裡面在探討是不是可以看看 ZFS 在 Ephemeral Storage (機器附的本地硬碟) 上的效能。

一開始測試是直接當主力硬碟來測,可以看到跑 ZFS 的情況下,本地的 storage 還是會比 SSD Premium (這是 Azure 的產品線) 還快不少:

但把資料放在本地的 storage 上其實有點刺激,至少在 production 應該不太會這樣搞,所以後面用 L2ARC 的方式來測,可以看到效率提昇相當明顯,甚至接近本來直接把資料放在本地的 storage:

另外測了 ext4/bcache,看起來效率就沒那麼好:

這樣看起來是個不錯的選擇...

Backblaze 的 2021Q1 硬碟報告

Backblaze 昨天放出來 2021Q1 的硬碟報告:「Backblaze Drive Stats for Q1 2021」。

前半部沒有什麼意外,HGST 的硬碟比起其他家的看起來還是好不少。

比較有趣的是首次拿 SSD 與 HDD 對決,這邊比較的對象是開機碟。可以看到如果以 2021Q1 的時間來看,SSD 的 AFR 低不少:

拉長到 lifetime 來看也還是低不少:

但裡面也有提到 HDD 的最大壽命比目前 SSD 都高不少,時間看起來可能還不夠長,算是一個很初步的資料...

用 SSD 的 I/O 暴力解

這篇「Achieving 11M IOPS & 66 GB/s IO on a Single ThreadRipper Workstation」用了 AMD 平台上的 PCI-e 4.0 硬幹出 4K 隨機讀取 11M IOPS 的速度,另外在大區塊讀取可以到 66GB/sec,後面這個速度應該是可以把 DDR4 記憶體頻寬吃滿...

硬體的部份,作者用了 8*1TB + 2*500GB 的 M.2 SSD 來建這組系統,然後接到卡上:

不過他好像沒提到這組機器的價錢 (雖然每個單品都查的到),大概算了一下 storage 的部份其實不怎麼貴,Samsung 980 Pro PCIe 4.0 M.2 SSD 的部份,1TB 每一條要 USD$160,500GB 要 USD$120,旁邊那些 CPU 與記憶體反而貴不少... 不過整台機器應該有機會在 USD$10000 搞定?

感覺拿來給剪片的人用很爽?至少在處理讀寫的時候應該是很順...

Amazon EBS 的 gp3 可以用在開機磁碟了

可以先參考「Amazon EBS 推出了 gp3」這篇,但剛出來的時候大家都有發現無論是透過 web console 還是透過 awscli,boot disk 都沒辦法改成 gp3,可是在官方的文件上又說可以用 gp3,所以就有人在 AWS 的 forum 上發問了:「EBS GP3 Boot Volume Issues」。

直到剛剛發現已經可以改成 gp3 了... 一個一個手動改當然也是 OK,但對於有一卡車 EBS 要換的人來說鐵定得弄指令來換,這邊搭配了 jq 一起改:

aws ec2 describe-volumes | jq '.Volumes[] | select(.VolumeType == "gp2") | .VolumeId' | xargs -n1 -P4 env aws ec2 modify-volume --volume-type gp3 --volume-id

這邊是把 gp2 都改成 gp3,沒有考慮到空間大小的問題 (因為超過 1TB 時 gp2 給的 IOPS 會比較多),另外 -P4 是平行四個 process 跑,改起來會快一些...

Amazon EBS 的 io2 給了不少新消息...

Amazon EBS 的另外一個新推出的東西,是針對 io2 的改善:

前面兩則消息可以一起看,主要是推出了 EBS Block Express,有著效能上的提昇:

Built on our new EBS Block Express architecture that takes advantage of some advanced communication protocols implemented as part of the AWS Nitro System, the volumes will give you up to 256K IOPS & 4000 MBps of throughput and a maximum volume size of 64 TiB, all with sub-millisecond, low-variance I/O latency. Throughput scales proportionally at 0.256 MB/second per provisioned IOPS, up to a maximum of 4000 MBps per volume. You can provision 1000 IOPS per GiB of storage, twice as many as before. The increased volume size & higher throughput means that you will no longer need to stripe multiple EBS volumes together, reducing complexity and management overhead.

目前因為是 preview 階段,想要用的人需要申請測試。要注意目前支援的區域有限 (不像這次推出 gp3 的時候就是全區),而且需要搭配 r5b 的機器:

The preview is currently available in the US East (N. Virginia), US East (Ohio), US West (Oregon), Asia Pacific (Singapore), Asia Pacific (Tokyo), and Europe (Frankfurt) Regions. During the preview, we support the use of R5b instances, with support for other Nitro-powered instances in the works.

第三則消息則是在講 io2 的 IOPS 的折扣,針對購買 32K IOPS 以上的部份會有 30% 折扣:

Now, with the new tiered pricing structure, the first 32,000 IOPS provisioned on a volume are charged at the current base rate ($0.065 per provisioned IOPS-mo) and the second tier between 32,001 and 64,000 is charged at a 30% lower rate ($0.046 per provisioned IOPS-mo).

針對前面提到的 preview 版本 (EBS Block Express),因為可以超過 64K IOPS,這個部份的價錢會更低,再疊一次 30% 的折扣:

Furthermore, for customers who have even higher performance requirement than currently supported by a single io2 volume today, we are previewing io2 volumes that run on EBS Block Express, the next generation of our block storage architecture. io2 Block Express volumes can be provisioned to deliver peak IOPS of 256,000. For these volume, any IOPS provisioned over 64,000 IOPS will be charged at a further 30% lower rate than the second tier ($0.032 per provisioned IOP-mo for IOPS over 64,000). This lowers the effective rate to $0.038 per provisioned IOPS on a volume provisioned with 256,000 IOPS.

算是要衝效能的人用的,目前平常應該還是會用 gp2 或是 gp3 的 SSD...

Amazon EBS 推出了 gp3

今年的 AWS re:Invent 又開始了,不過因為疫情的關係,這次是線上為主... 這邊先來整理一下 Amazon EBS 相關的更新。

首先是推出了新的 gp3 類型,也是 SSD 類:「New – Amazon EBS gp3 Volume Lets You Provision Performance Apart From Capacity」。

每 GB 單位成本比 gp2 低 20%:

Today I would like to tell you about gp3, a new type of SSD EBS volume that lets you provision performance independent of storage capacity, and offers a 20% lower price than existing gp2 volume types.

然後直接給你 3000 IOPS 與 125MB/sec,有需要更高的話可以「加購」:

gp3 is designed to provide predictable 3,000 IOPS baseline performance and 125 MiB/s regardless of volume size. It is ideal for applications that require high performance at a low cost such as MySQL, Cassandra, virtual desktops and Hadoop analytics. Customers looking for higher performance can scale up to 16,000 IOPS and 1,000 MiB/s for an additional fee. The top performance of gp3 is 4 times faster than max throughput of gp2 volumes.

但照「Amazon EBS volume types」這邊的列表可以看到,要注意 gp2 可以 burst 的 throughput (250MB/sec) 比 gp3 的 baseline (125MB/sec) 高。

也因為這樣,可以把一些 random access 比較多的 /data 這類的 EBS 換過去,但如果是要大量 sequential access 的也許就不適合了。

IOPS 的部份,1TB 以下的 gp2 換過去應該是沒什麼太大問題,因為在 gp2 的時候是 1GB 給 3IOPS,所以 1TB 以下的 gp2 都低於 3000IOPS。

轉移的部份可以在 AWS 的 console 上直接 migrate 到 gp3

If you’re currently using gp2, you can easily migrate your EBS volumes to gp3 using Amazon EBS Elastic Volumes, an existing feature of Amazon EBS. Elastic Volumes allows you to modify the volume type, IOPS, and throughput of your existing EBS volumes without interrupting your Amazon EC2 instances.

像是這樣:

但照「Amazon EBS volume types」這邊的列表,gp3 可以是開機硬碟,但是改不過去啊 XDDD

Update:剛剛發現文件被修正了,看起來不能當開機硬碟...

不知道哪邊搞錯了,過幾天看看吧 XDDD

在 Hacker News 上看到 Raspberry Pi 400 使用心得

Hacker News 看到 Raspberry Pi 400 的使用心得:「I've now played with a Raspberry Pi 400 for a week and here are my conclusions」,先前在「Raspberry Pi 400」這邊有提到 Raspberry Pi 400,主要就是一台 Raspberry Pi 4 Model B 的主機,但跟鍵盤整合在一起。

在文章裡提到了 Raspberry Pi 4 可以 USB Boot 後帶來的改變 (參考之前寫的「Raspberry Pi 4 可以透過 USB 開機了」這篇),主要是透過 USB3 外接硬碟可以讓讀寫速度大幅提昇 (尤其是 SSD),這一直都是 Raspberry Pi 上面用 SD card 的問題,看起來唯一的問題還是 CPU 的速度還是沒有像目前常見的 x86-64 強。

If you give it fast enough "disk" storage it really moves. I plugged in a Kingston brand 120GB SSD on a USB3 adapter. hdparm -t gave 292MB/s read speed and the default LXDE environment was really crisply responsive, with even a first launch of Chromium taking less than two seconds. With such good storage, the only real limitation is that heavy Javascript stuff is too slow - 5+ seconds to switch between folders in Chrome, or for the thumbnail gallery to appear in Youtube. Also, video calling is marginal. Aside from that the CPU is fast enough.

另外討論裡面也有人希望 Raspberry Pi 考慮引入 eMMC 或是提供 M.2 界面改善讀寫速度,不過我覺得 SD card 的設計算是 Raspberry Pi 當初的方向,本來就有取捨,不太可能什麼都做進去...

回到作者的心得,雖然 USB3 轉 SSD 看起來 i/o 速度快不少,但我好像主要不是遇到 i/o 速度問題,反倒是最近 chromium 的硬體解碼好像有些進度,也許看影片有機會用硬體處理 (至少一部份?),希望至少可以輕鬆看 1080p60 啊...