npm 裡的 redis 與 ioredis

前幾天在噗浪的偷偷說上看到有人提到 npmtrends 上的 redis (官方的) 與 ioredis:「https://www.plurk.com/p/p6wdc9」。

意外發現以下載量來看,ioredis 已經超越官方的 redis 了:

找了一下差異,看起來的確有些團隊在 loading 很高的情況下會考慮用 ioredis 取代 redis:「Migrating from Node Redis to Ioredis: a slightly bumpy but faster road」。

但沒有特別需求的話應該還是會用官方版本?

由 pnpm 做出來的效能比較:npm、yarn 以及 pnpm

找資料的時候看到 pnpm 有在做幾個常見的 javascript package manager 的比較:「Benchmarks of JavaScript Package Managers」。

裡面測了九個項目,其中前八個剛好就是 cache 的有無、lockfile 的有無以及 node_modules 的有無,這三者的組合剛好八個,最後一個是測 update 的速度。

pnpm 因為用了 hard link,我先跳過去,只先關注 npmyarn 的差異。

其中幾個比較重要的項目是 (由最重要開始列):

  • with cache, with lockfile, with node_modules:這個會是一般開發的情境。
  • update:套件有更新時的時間。
  • with lockfile:這個會是第一次 clone 下來後的環境,以及 CI 的環境。

在一般開發環境下,npm 比 yarn 快一些,這點讓我比較意外,我知道 npm 這幾年一直有在改善效率,但沒想到會在 benchmark 下反超,這個是以前 yarn 很大的宣傳點,現在已經不見了。

第二個是 update (upgrade) 的部份,這邊 yarn 比 npm 快一些。

第三個因為是 CI 環境,或是第一次 clone 時的環境,慢通常可以被接受,不要差太多就好,而這邊 npm 比 yarn 快一些。

但不管是哪個,差距都不像以前那麼大了,就效率上來說,官方的 npm 已經沒有太明顯的缺點,因為效率的差異而去選擇 yarn 的理由已經不存在了。

Netflix 單機 800Gbps 伺服器所使用的最佳化技巧

Hacker News 上看到 Netflix 的人丟出來的投影片,試著了解 Netflix 的 Open Connect Appliances 裡與 FreeBSD 相關的最佳化技巧對於效能的影響:「The “other” FreeBSD optimizations used by Netflix to serve video at 800Gb/s from a single server」。

看起來這邊的分析是先基於 400Gbps 的版本,可以跑到 375Gbps (53% CPU),接著在上面拔掉各種最佳化的設定,看看會掉多少流量。這邊可以參考先前在「Netflix 在單機服務 400Gbps 的影音流量」提到的資料。

投影片上的第一章是 sendfile 與 kTLS 相關的最佳化,這邊可以看出來都是重要的項目,隨便關掉一個就會掉很多 capacity:

  • Disable kTLS (and async sendfile) + nginx aio:40Gbps (100% CPU)
  • Disable kTLS (and async sendfile) + nginx thread pools:90Gbps (90% CPU)
  • Disable sendfile (but use kTLS):75Gbps (80% CPU)
  • Disable sendfile (but use NIC kTLS):95Gbps (80% CPU)
  • Enable Sendfile & kTLS, but disable ISA-L crypto:180Gbps (80% CPU)
  • Enable Sendfile & kTLS:240Gbps (80% CPU)

第二章是 virtual memory,UMA VM Page Cache 這邊看起來最明顯,SF_NOCACHE 也是個重要的項目:

  • Disable UMA VM Page Cache:60Gbps (95% CPU)
  • Disable VM Batch Queues:280Gbps (95% CPU)
  • Disable SF_NOCACHE:120Gbps (55% CPU)

另外第二章特別提到了一個之前沒有用到的 optimization,是把 arm64 上面的 4KB Pages 變成 16KB Pages,這帶動了些許的效能提昇,並且降低了 CPU 使用率:

345Gb/s @ 80% CPU -> 368Gb/s @ 66% CPU

第三章是 network stack,看起來 TSO 帶來的效益也是很高:

  • Disable TCP Large Receive Offload:330Gbps (65% CPU)
  • Disable RSS accelerated LRO:365Gbps (70% CPU)
  • TSO Disabled:180Gbps (85% CPU)
  • Disable TSO and LRO:170Gbps (85% CPU)

最後面則是有提到從 400Gbps 到 800Gbps 還多做了那些事情,最後是達到 731Gbps。

用的機器是 Dell PowerEdge R7525,這是一台 2U 的機器啊...

CPU Core 之間溝通的時間成本

Hacker News 上看到「Measuring CPU core-to-core latency (github.com/nviennot)」這篇,專案在「Measuring CPU core-to-core latency」這裡,看起來是個有趣的研究,測試許多不同 CPU 內,跨 core 之間溝通的時間花費。

依照專案的說明,測試的方式是利用 cache coherence 來來量測:

We measure the latency it takes for a CPU to send a message to another CPU via its cache coherence protocol.

By pinning two threads on two different CPU cores, we can get them to do a bunch of compare-exchange operation, and measure the latency.

裡面已經測了很多不同的 CPU,然後可以看到一些有趣的結果。

像是第一張圖片的「Intel Core i9-12900K @ 8+8 Cores (Alder Lake, 12th gen) 2021-Q4」這組,大家還蠻好奇 CPU #8 到底是怎麼一回事,跨 core 溝通的 latency 特別低,還特別找了 CPU 的 die 圖片看看:

另外一個是 AWS 上的 c6a.metal,機種是「AMD EPYC 7R13 @ 48 Cores (Milan, 3rd gen) 2021-Q1」,可以看到被分成了六個區塊:

接下來在 ARM 平台,在更多 CPU core 的 c7g.16xlarge 上,機種「AWS Graviton3 @ 64 Cores (Arm Neoverse, 3rd gen) 2021-Q4」,會看到更多不平均的現象:

早一點的 c6gd.metal 雖然也還是 ARM 的 64 cores 機種「AWS Graviton2 @ 64 Cores (Arm Neoverse, 2nd gen) 2020-Q1」,但可以看到很不一樣的 latency pattern:

大致上可以感覺到當 core 數愈多就會有很多技術上的瓶頸,導致不同 core 之間的溝通成本不一樣... 這個感覺跟當初學到 NUMA 的情況有點像。

Firefox 的 RCWN (Race Cache With Network)

前幾天 Hacker News 上看到「When network is faster than browser cache (2020) (simonhearne.com)」這則 2020 的文章,原文在「When Network is Faster than Cache」這邊,講 Firefox 在 2017 年導入了一個特別的設計,除了會在 cache 裡面抓資料以外,也會到網路上拉看看,有機會從網路上抓到的資料會比 cache 先得到,這個功能叫做 RCWN (Race Cache With Network):「Enable RCWN」。

開頭就先提到了有人回報 Firefox 上的 RCWN 似乎沒有明顯效果:「Tune RCWN racing parameters (and make them pref-able)」。

On my OSX box I'm seeing us race more than we probably need to:

Total network request count: 5574
Cache won count 938
Net won count 13

That's racing almost 16% of the time, but only winning 1.3% of the time. We should probably back off on racing a bit in this case, at least.

16% 的 request 決定 RCWN 兩邊打,但裡面只有 1.3% 是 network 比 cache 快。

不過作者決定試著再多找看看有沒有什麼方向可以確認,但測了很多項目都找不到哪個因素跟 cache retrieval time 有直接相關,反而在看看 Chromium 時發現 Chromium 是透過限制連線數量,降低 I/O 造成的問題:

It turns out that Chrome actively throttles requests, including those to cached resources, to reduce I/O contention. This generally improves performance, but will mean that pages with a large number of cached resources will see a slower retrieval time for each resource.

看起來就是個簡單粗暴的 workaround...

Cloudflare 推出了讓你買 cache 空間的 Cache Reserve

這幾天 Cloudflare 推出了一大包東西,其中一個是 Cache Reserve:「Introducing Cache Reserve: massively extending Cloudflare’s cache」。

一般的使用情境是依照 LRU 演算法在決定 Cloudflare 的 cache 滿的時候要排除誰:

We do eviction based on an algorithm called “least recently used” or LRU. This means that the least-requested content can be evicted from cache first to make space for more popular content when storage space is full.

Cache Reserve 就是自己買 cache 空間,他的作法是你付 R2 的空間費用:

Cache Reserve is a large, persistent data store that is implemented on top of R2.

這樣就可以完全依照 Cache-Control 這類 HTTP header 內的時間保存了,你就不用擔心會被 purge 掉,首先價錢包括了 R2 的部份:

The Cache Reserve Plan will mimic the low cost of R2. Storage will be $0.015 per GB per month and operations will be $0.36 per million reads, and $4.50 per million writes.

另外還有還沒公告的 Cache Reserve 的部份:

(Cache Reserve pricing page will be out soon)

對於很極致想要拼 hit rate 的使用者來說是個選擇就是了,另外可以想到直播相關的協定 (像是 HLS) 好像可以這樣搞來壓低對 origin server 的壓力?

Amazon EFS 效能提昇的一些討論

上一篇「Amazon EFS 的效能提昇」提到 Amazon EFS 的效能提昇,在 Hacker News 上看到 Amazon EFS 團隊的 PMT (Product-Manager-Technical) 出來回一些東西:「Amazon Elastic File System Update – Sub-Millisecond Read Latency (amazon.com)」,搜尋 geertj 應該就可以看到他回的東西了...

像是即使是 Jeff Barr 發表這篇文章,也還是經過 legal team 的同意才能發表:

(PMT on the EFS team).

Yes, the wordings are carefully formulated as they have to be signed off by the AWS legal team for obvious reasons. With that said, this update was driven by profiling real applications and addressing the most common operations, so the benefits are real. For example, a simple WordPress "hello world" is now about 2x as fast as before.

另外這次的效能提昇是透過 cache 層達成的:

I'm the PMT for this project in the EFS team. The "flip the switch" part was indeed one of the harder parts to get right. Happy to share some limited details. The performance improvement builds on a distributed consistent cache. You can enable such a cache in multiple steps. First you deploy the software across the entire stack that supports the caching protocol but it's disabled by configuration. Then you turn it for the multiple components that are involved in the right order. Another thing that was hard to get right was to ensure that there are no performance regressions due to the consistency protocol.

然後在每個 AZ 都有 cache:

The caches are local to each AZ so you get the low latency in each AZ, the other details are different. Unfortunately I can't share additional details at this moment, but we are looking to do a technical update on EFS at some point soon, maybe at a similar venue!

另外看起來主要就是 metadata cache 的幫助:

NFS workloads are typically metadata heavy and highly correlated in time, so you can achieve very high hit rates. I can't share any specific numbers unfortunately.

還是有很多細節數字不能透漏,但知道是透過 cache 達成的就已經可以大致上想像後面是怎麼弄出來的了...

把 Blog 丟到 CloudFront 上

先前在「AWS 流量相關的 Free Tier 增加不少...」這邊有提到一般性的流量從 1GB/month per region 升到 100GB/month,另外 CloudFront 則是大幅增加,從 50GB/month (只有註冊完的前 12 個月) 提升到 1TB/month (不限制 12 個月),另外 CloudFront 到 EC2 中間的流量是不計費的。

剛剛花了點功夫把 blog 從 Cloudflare 搬到 CloudFront 上,另外先對預設的 /* 調整成 no cache,然後針對 /wp-content/* 另外加上 cache 處理,跑一陣子看看有沒有問題再說...

目前比較明顯的改善就是 latency,從 HiNet 連到免費版的 Cloudflare 會導去美國,用 CloudFront 的話就會是台灣了:

另外一方面,這樣國際頻寬的部份就會走進 AWS 的骨幹,比起透過 HiNet 自己連到美國的 PoP 上,理論上應該是會快一些...

在 Minecraft 裡面幹出一台完整的電腦

Lobsters Daily 上看到有強者在 Minecraft 實做邏輯電路,幹出一台完整的電腦出來 (CPU 的部份應該是 turing-complete 了):

PCWorld 有報導:「This 8-bit processor built in Minecraft can run its own games」。

把影片裡的描述截圖出來:

連分支預測器都出現了:

記憶體就 Minecraft 來說也是超大的 256 bytes:

然後還做了 cache 層,這邊提到的是 data cache:

然後這邊是 instruction cache:

也因為已經相當的 powerful,很多經典遊戲都可以玩,像是俄羅斯方塊:

貪食蛇:

打磚塊:

Connect Four

來看 Intel + Varnish 的單機 500Gbps 的 PR 新聞稿

在「Varnish Software Achieves 500Gbps Throughput Per Server for UHD Video Content」這邊看到 PR 稿,由 IntelVarnish 合作,宣稱達到單機 500Gbps 的 throughput 了:

According to Varnish Software, the following were the outcomes of the test:

  • 509.7 Gbps live-linear throughput, using a dual-processor configuration
  • 487.2 Gbps video-on-demand throughput, using a dual-processor configuration

白皮書在「Delivering up to 500 Gbps Throughput for Next-Gen CDNs」這頁可以用個資交換下載,不過用搜尋引擎找一下可以發現 Intel 那邊有放出 PDF (但不確定兩邊給的是不是同一份):「Delivering up to 500 Gbps Throughput for Next-Gen CDNs」。

單 CPU 的伺服器是四個 100Gbps 界面接出來,雙 CPU 的伺服器是八個 (這邊 SUT 是 system under test 的縮寫):

These client systems were connected to the CDN servers using 100 GbE links through a switch; 4x100 GbE connections for the single-processor SUT, and 8x100 GbE for the dualprocessor SUT. Testing was done using Wrk, a widely recognized open-source HTTP(S) benchmarking tool.

不過如果實際看圖會發現伺服器是兩個 100Gbps (單 CPU) 與四個 100Gbps (雙 CPU),然後 wrk 也吃了兩個或是四個 100Gbps:

在白皮書最後面也有提到測試的配置,都是在 Ubuntu 20.04 上面跑,單 CPU 用的是兩張 Intel 的 100Gbps 網卡,雙 CPU 的用的是四張 Mellanox 的 100Gbps 網卡:

3rd generation Intel Xeon Scalable testing done by Intel in September 2021. Single processor SUT configuration was based on the Supermicro SMC 110P-WTR-TNR single socket server based on Intel® Xeon® Platinum 8380 processor (microcode: 0xd000280) with 40 cores operating at 2.3 GHz. The server featured 256 GB of RAM. Intel® Hyper-Threading Technology was enabled, as was Intel® Turbo Boost Technology 2.0. Platform controller hub was the Intel C620. NUMA balancing was enabled. BIOS version was 1.1. Network connectivity was provided by two 100 GbE Intel® Ethernet Network Adapters E810. 1.2 TB of boot storage was available via an Intel SSD. Application storage totaled 3.84TB per drive and was provided by 8 Intel P5510 SSDs. The operating system was Ubuntu Linux release 20.04 LTS with kernel 5.4.0-80 generic. Compiler GCC was version 9.3.0. The workload was wrk/master (April 17, 2019), and the version of Varnish was varnishplus-6.0.8r3. Openssl v1.1.1h was also used. All traffic from clients to SUT was encrypted via TLS.

3rd generation Intel Xeon Scalable testing done by Intel in September 2021. Dual processor SUT configuration was based on the Supermicro SMC 22OU-TNR dual socket server based on Intel® Xeon® Platinum 8380 processor (microcode: 0xd000280) with 40 cores operating at 2.3 GHz. The server featured 256 GB of RAM. Intel® Hyper-Threading Technology was enabled, as was Intel® Turbo Boost Technology 2.0. Platform controller hub was the Intel C620. NUMA balancing was enabled. BIOS version was 1.1. Network connectivity was provided by four 100 GbE Mellanox MCX516A-CDAT adapters. 1.2 TB of boot storage was available via an Intel SSD. Application storage totaled 3.84TB per drive and was provided by 12 Intel P5510 SSDs. The operating system was Ubuntu Linux release 20.04 LTS with kernel 5.4.0-80- generic. Compiler GCC was version 9.3.0. The workload was wrk/master (April 17, 2019), and the version of Varnish was varnish-plus6.0.8r3. Openssl v1.1.1h was also used. All traffic from clients to SUT was encrypted via TLS.

不過馬上就會滿頭問號,四張 100Gbps 是怎麼跑到 500Gbps 的頻寬...

這份 PR 馬上就讓人想到 Netflix 先前放出來的投影片 (先前有在「Netflix 在單機服務 400Gbps 的影音流量」這篇提到),在 Netflix 的投影片裡面有提到他們在 Intel 平台上面受限於記憶體的頻寬,整台機器只能跑到 230Gbps。

另外一種猜測是,如果 Intel 與 Varnish 宣稱的 500Gbps 是算 switch 上的總流量 (有這樣算的嗎,你是 Juniper 嗎...),那這邊的 500Gbps 換算回去差不多就是減半 (還很客氣的沒把 cache 沒中需要去 origin server 拉資料的流量扣掉),跟 Netflix 在 FreeBSD 上跑出來的結果差不多啊...

坐等反駁 XDDD