Hacker News 前幾天炸很久的 root cause

前幾天 Hacker News 炸了很久,如果是從 Twitter 上的資料來看,是從 2022/07/08 14:08 UTC 這篇:

中間還原失敗 (2022/07/08 17:35 UTC):

到最後恢復 (2022/07/08 20:48 UTC):

Twitter 這邊的資料看起來差不多是六個小時多,以一個應該是只有 database 需要還原的站台來說的確是蠻久的,所以後續在「HN is up again」這邊就有在討論原因,裡面 HN 的老大 dang 也有提到 downtime 是七個小時多:

8 hours of downtime, but not data loss, since there was no data to lose during the downtime.

Last post before we went down (2022-07-08 12:46:04 UTC): https://news.ycombinator.com/item?id=32026565

First post once we were back up (2022-07-08 20:30:55 UTC): https://news.ycombinator.com/item?id=32026571 (hey, that's this thread! how'd you do that, tpmx?)

So, 7h 45m of downtime. What we don't know is how many posts (or votes, etc.) happened after our last backup, and were therefore lost. The latest vote we have was at 2022-07-08 12:46:05 UTC, which is about the same as the last post.

There can't be many lost posts or votes, though, because I checked HN Search (https://hn.algolia.com/) just before we brought HN back up, and their most recent comment and story were behind ours. That means our last backup on the ill-fated server was taken after the last API update (HN Search relies on our API), and the API gets updated every 30 seconds.

I'm not saying that's a rock-solid argument, but it suggests that 30 seconds is an upper bound on how much data we lost.

另外大家就在找 dang 的回應是什麼 (畢竟是第一手資料),用 Ctrl-F 找一下就看到有趣的猜測,從 32028511 這個節點可以看到這串有趣的討論,首先是 mikeiem

You are never going to guess how long the HN SSDs were in the servers... never ever... OK... I'll tell you: 4.5years. I am not even kidding.

然後是 kabdib 的回應:

Let me narrow my guess: They hit 4 years, 206 days and 16 hours . . . or 40,000 hours.

And that they were sold by HP or Dell, and manufactured by SanDisk.

Do I win a prize?

(None of us win prizes on this one).

接著就是 dang 說他覺得這個猜測很有可能:

Wow. It's possible that you have nailed this.

Edit: here's why I like this theory. I don't believe that the two disks had similar levels of wear, because the primary server would get more writes than the standby, and we switched between the two so rarely. The idea that they would have failed within hours of each other because of wear doesn't seem plausible.

But the two servers were set up at the same time, and it's possible that the two SSDs had been manufactured around the same time (same make and model). The idea that they hit the 40,000 hour mark within a few hours of each other seems entirely plausible.

Mike of M5 (mikiem in this thread) told us today that it "smelled like a timing issue" to him, and that is squarely in this territory.

後續他也從自家的 /newest 裡面撈了相關的資料出來,依照他撈出來的關鍵字,看起來是用 HPE 出的 SSD:

It's also an example of the dharma of /newest – the rising and falling away of stories that get no attention:

HPE releases urgent fix to stop enterprise SSDs conking out at 40K hours - https://news.ycombinator.com/item?id=22706968 - March 2020 (0 comments)

HPE SSD flaw will brick hardware after 40k hours - https://news.ycombinator.com/item?id=22697758 - March 2020 (0 comments)

Some HP Enterprise SSD will brick after 40000 hours without update - https://news.ycombinator.com/item?id=22697001 - March 2020 (1 comment)

HPE Warns of New Firmware Flaw That Bricks SSDs After 40k Hours of Use - https://news.ycombinator.com/item?id=22692611 - March 2020 (0 comments)

HPE Warns of New Bug That Kills SSD Drives After 40k Hours - https://news.ycombinator.com/item?id=22680420 - March 2020 (0 comments)

(there's also https://news.ycombinator.com/item?id=32035934, but that was submitted today)

這次 downtime 看起來很像是中了 SSD firmware bug,目前看起來先搬到 EC2 上面了:

$ host news.ycombinator.com
news.ycombinator.com has address 50.112.136.166
$ host 50.112.136.166      
166.136.112.50.in-addr.arpa domain name pointer ec2-50-112-136-166.us-west-2.compute.amazonaws.com.

看討論串應該是暫時性的?

HP 印表機的 Port 與 Prometheus...

Twitter 上看到這個,HP 印表機的 Port 9100 跟 Prometheus 撞到,再加上 mistype,於是就出事了:

找了一下 HP 的文件,「HP Jetdirect Print Servers - HP Jetdirect Port Numbers for TCP/IP (UDP) Connections」:

9100 TCP port is used for printing. Port numbers 9101 and 9102 are for parallel ports 2 and 3 on the three-port HP Jetdirect external print servers.

翻了一下「Service Name and Transport Protocol Port Number Registry」這邊,看起來 HP 在很久前就登記了 9100/tcp 與 9100/udp...

不過這沒有誰對誰錯的問題,只是很好笑:Printer 在收到不認識的指令時會直接當做 text 印出來,加上 Prometheus 的 HTTP request 打進去...

Intel CPU + AMD GPU 合一的的系統

先前就有看到 Intel 要與 AMD 合作,將 Intel CPU + AMD GPU 整合在一起以對抗 Nvidia,現在看到 HP 推出對應的筆電了:「HP’s new 15-inch Spectre x360 uses the hybrid Intel/AMD processor」。

不過名字剛好跟最近的安全漏洞撞到了 XDDD (所以才想寫 XDDD)

The new Spectre x360 15 is one of the first systems to be announced that uses the new Kaby Lake-G processors from Intel. These processors combine an Intel CPU (with its own integrated GPU) with an AMD GPU, all within a single package.


出自「Kaby Lake-G unveiled: Intel CPU, AMD GPU, Nvidia-beating performance」。

這種合作的仗打不打的動呢... 不怎麼看好就是了 :o

這次換 HP 裝 Spyware 啦~

討論的頗熱烈的,像是「HP is installing spyware on its machines disguised as an "analytics client"」、「HP stealthily installs new spyware called HP Touchpoint Analytics Client」。

這個軟體會被注意到是因為吃太多資源,而且使用者沒有同意安裝這個軟體 (目前看是起來是透過自動更新機制裝進去的):「Didn't Install HP TouchpointAnalyticsClient and It's Causing CPU 95-98 Red」(先備份一份在這邊,以免被砍...)。

然後這軟體很明顯會傳資料回 HP:

The HP Touchpoint Manager technology is now being delivered as a part of HP Device as a Service (DaaS) Analytics and Proactive Management capabilities. Therefore, HP is discontinuing the self-managed HP Touchpoint Manager solution.

先前聯想因為類似的行為賠了 350 萬美金,這次 HP 搞這包不知道會怎麼樣...

HP 的 audio driver 內含 Keylogger

HP 被發現 2015 年簽出來的 audio driver 內含 keylogger (yeah,因為有數位簽名,所以賴不掉...):「[EN] Keylogger in Hewlett-Packard Audio Driver」。完整的報告在「modzero Security Advisory: Unintended/Covert Storage Channel for sensitive data in Conexant HD Audio Driver Package. [MZ-17-01]」這邊。

keylogger 記錄後會寫到 local file 裡。來拉板凳...

Etsy 用 SSD 的故事

EtsyLaurie Denness 對於 Etsy 使用各種品牌 SSD 的情況給出了他的經歷:「SSDs: A gift and a curse」。

重點在於開頭說的:

SSD firmware is buggy

可以看到當 SSD 配上 RAID controller 的時候,常常會需要找問題... (而且很難找)

Intel 的評價很不錯:

Okay, bad start, we’ve actually had no issues with Intel. This seems to be common across other companies we’ve spoken to.

OCZ 倒了,被 Toshiba 收購,而且 S.M.A.R.T. 資訊很差,很難預測什麼時候會掛掉 (有助於提前替換):

However, they had poor SMART info (none) so predicting failures was hard.

HP 是個大黑盒:

Unfortunately, HP have proprietary RAID controllers, and they don’t support SMART. Or rather, they refuse to talk to non-HP drives using off the shelf technology, they have their own methods.

Samsung 的評價不錯,C/P 值很高,而且有 S.M.A.R.T.:

Samsung saved the day and picked up from OCZ with a ludicrously cheap 960GB offering, the 840 EVO. A consumer drive, so very limited warranty, but for the price (~$400-500) you got great IOPS and they were reliable. They had better SMART info, and seemed to play nicely with our hardware.

不過 BB6Q 版的韌體搞爆了效能,雖然最後修好了:「Samsung Releases Firmware Update to Fix the SSD 840 EVO Read Performance Bug」。

LiteOn 則是掛在 GC 上 (RAID 裡同時掛掉兩顆以上):

The SSDs were having extended garbage collection periods, exacerbated by a smaller amount of SSDs with higher IO, in RAID6. This caused the controller to kick the drive out of the array… and unfortunately due to the write levelling across the drives, at least two of them were garbage collecting at the same time, destroying the array integrity.

不過後來 Dell 與 LiteOn 分別就 RAID controller 與 SSD 本身都跳下去修正,最後還是解決了:

Dell and LiteOn together identified and fixed weaknesses in their RAID controller, the backplane and the SSD firmware.

算是經驗分享,在 SSD 硬碟成熟的過程中間必經的道路 XD

Nokia 以 166 億美金買下 Alcatel-Lucent

在「Nokia to acquire Alcatel-Lucent」這邊看到的報導,Nokia 以 166 億美金買下 Alcatel-Lucent

Nokia 的新聞稿在「NOKIA AND ALCATEL-LUCENT TO COMBINE TO CREATE AN INNOVATION LEADER IN NEXT GENERATION TECHNOLOGY AND SERVICES FOR AN IP CONNECTED WORLD」這邊。

其中 OSNews 被拿出來講的... 由於 Nokia 將手機部門賣給了微軟,所以 Nokia 其實是不能發展手機的 (應該有時間限制),但是 Alcatel-Lucent 現在手上有:

Nokia is not allowed to make smartphones for a while, but Alcatel-Lucent does make smartphones.

而這是從 HP 買來的 Palm... (所以 Nokia 又要玩什麼花招了 @_@)

單台 12TB RAM 的 server

翻資料的時候翻到「Scalability Techniques for Practical Synchronization Primitives」這篇文章,裡面剛好提到 HP 的高階伺服器:

HP 的 CS900 (HP ConvergedSystem) 看起來是搭著 SAP HANA 而出的硬體...


出自「http://ascii.jp/elem/000/000/946/946459/img.html

記錄起來,如果有需要用到的時候可以去問問 HP...

試用 HP Cloud (Compute 部分)

剛剛註冊了 HP Cloud 測試 Compute 的部分 (相對於 AWS 就是 EC2)。

先就公開的資訊來看,其實還不錯?最小的 instance 是 Extra Small,規格是 1GB RAM + 30GB Disk,收費 USD$0.04/hour,如果是 Small 則是 2GB RAM + 60GB Disk,另外在 Public Beta 期間是 50% off (半價)。

頻寬部分與 AWS 美國區的價錢相同,Inbound 不收費,Outbound 與 AWS 階梯式相同。

有提供網頁介面操作外,也有提供 CLI 可以用:「Unix Command Line Interface」,看起來是 Ruby... (沒裝起來測)

實際註冊完成後可以由從 zone 的名稱「az-1.region-a.geo-1」推敲出實際的機房位置在亞利桑那州 (實際開起來後也是如預期),HiNet 過去大約 140ms。

就 Public Beta 服務來看,要開 instance 時可以選擇的 OS image 也還算豐富:

CPU 部分,openssl speed 的數據是:「HP Cloud, openssl speed」,速度比起 EC2 的 m1.small 算是相當好的。

硬碟部分,iozone -a 的數據:「HP Cloud, iozone -a (over XFS)」,我是在跑 iozone 的時候用 dstat -at 觀察,寫入極限大約有 90MB/sec 到 100MB/sec,不過實際用 dd 測試就會發現應該是 host 有 cache,因為檔案很大的時候只有 20MB/sec 的寫入速度 XD

晚點再來測 Object Storage 與 CDN 的部分...