NVIDIA 開源 Linux GPU Kernel Driver

NVIDIA 宣佈開源 Linux 下的 GPU Kernel Driver:「NVIDIA Releases Open-Source GPU Kernel Modules」。

從一些描述上可以看出來,應該是因為 Datacenter 端的動力推動的,所以這次 open source 的版本中,對 Datacenter GPU 的支援是 production level,但對 GeForce GPU 與 Workstation GPU 的支援直接掛 alpha level:

Which GPUs are supported by Open GPU Kernel Modules?

Open kernel modules support all Ampere and Turing GPUs. Datacenter GPUs are supported for production, and support for GeForce and Workstation GPUs is alpha quality. Please refer to the Datacenter, NVIDIA RTX, and GeForce product tables for more details (Turing and above have compute capability of 7.5 or greater).

然後 user-mode driver 還是 closed source:

Will the source for user-mode drivers such as CUDA be published?

These changes are for the kernel modules; while the user-mode components are untouched. So the user-mode will remain closed source and published with pre-built binaries in the driver and the CUDA toolkit.

nouveau 來說,是可以從 open source driver 裡面挖一些東西出來用,不過能挖到跟 proprietary 同樣效能水準嗎?

多資料中心的備援機制

100% 的 uptime 有點誇張啦,不過看到這篇文章有些啟發:「Multi-Home the Data Center for 100% uptime」。

這張圖裡面,如果 DC1 的 DB master 掛掉,那麼 DC2 也只能進入唯讀狀態?甚至有些應用連唯讀狀態都不能進...

不過想到 Percona XtraDB Cluster 放三個機房的方法,不過如果只放三台的話會有其他的風險就是了,本粗一點變成六台 (每個機房都兩台) 也是個方法 :o

DigitalOcean 建立新加坡機房...

今天 DigitalOcean 宣佈新加坡資料中心營運 (SGP1):「We're Excited To Announce Our Singapore Datacenter (SGP1)」。

要測試 latency 或是要看 routing 的人可以用 DigitalOcean 提供的 speedtest-sgp1.digitalocean.com 測。

中華電信 HiNet 光世代動態 IP、PPPoE 固定 IP,以及三重的重新機房到 speedtest-sgp1.digitalocean.com 都是 90ms~100ms 左右。

台灣固網內湖機房約 75ms 左右。

而目前看到數字最好的是遠傳的機房的 60ms 左右,ISP 直接進香港 NTT 後就轉入新加坡 NTT,最後進 DigitalOcean。

在 comment 裡也有提到目前的 peering 還不完整,最近會一直調整:

In regards to the latency that people may be experiencing in Singapore: We are sorry to hear that you are having latency issues at this time. Some of our peering is delayed and we will be improving generally connectivity around Asia in the coming weeks.

以後要測試就拿這個點了!XD

Twitter 新機房,以及數據...

Twitter Engineer Blog 上的「The Great Migration, the Winter of 2011」這篇文章裡提到了 Twitter 預定搬機房的行程,裡面有些數據...

目前維護的人數:

Today, the feed and care of Twitter requires more than 200 engineers to keep the site growing and running smoothly.

機器數量超過 1000 台:(用 thousands 這個詞)

Simultaneously, our operations engineers divided into new teams and built new processes and software to allow us to qualify, burn-in, deploy, tear-down and monitor the thousands of servers, routers, and switches that are required to build out and operate Twitter.

Tweet 的資料量:

Once we proved our replication strategy worked, we built out the full Twitter stack, and copied all 20TB of Tweets, from @jack’s first to @honeybadger’s latest Tweet to the second data center.

機房斷線最常見的肇因:UPS

UPS 反而是機房斷線最常見的肇因:「Survey: UPS Issues Are Top Cause of Outages」。

這是美國機房的調查,取樣則是從「歸咎於機房的問題」中的 453 件分析,原因包括了:

  • UPS battery failure (65 percent)
  • Exceeding UPS capacity (53 percent)
  • Accidental emergency power off (EPO)/human error (51 percent)
  • UPS equipment failure (49 percent)

應該是多選吧?不然超過 100% 了?