AWS Lambda 的 cache 架構

Lobsters 上看到的老文章:「[Cache Architecture for] Container Loading in AWS Lambda」,原文從 url 看起來是去年五月發表的資訊了:「Container Loading in AWS Lambda」。

主要是在講 container 怎麼 load 才會儘快執行,首先是提到了大家常用的 layer cache,在 AWS Lambda 上則是改用了 block level cache:

Most of the existing systems do this at the layer or file level, but we chose to do it at the block level.

然後每一塊 512KB:

We unpack a snapshot (deterministically, which turns out to be tricky) into a single flat filesystem, then break that filesystem up into 512KiB chunks.

接著是提到 lazy load 的方式:「Slacker: Fast Distribution with Lazy Docker Containers」:

Our analysis shows that pulling packages accounts for 76% of container start time, but only 6.4% of that data is read.

Slacker speeds up the median container development cycle by 20x and deployment cycle by 5x.

而這個技巧也被用在 AWS Lambda 上,而且是透過 FUSE 實作:

In Lambda, we did this by taking advantage of the layer of abstraction that Firecracker provides us. Linux has a useful feature called FUSE provides an interface that allows writing filesystems in userspace (instead of kernel space, which is harder to work in).

另外一個 AWS Lambda 有實作的是 tiered caching,分成三層,包括了 worker 的 local cache (L1)、同一個 AZ 上的 cache (L2) 以及 S3 上的資料 (L3):

Despite our local on-worker (L1) cache being several orders of magnitude smaller than the AZ-level cache (L2) and that being much smaller than the full data set in S3 (L3), we still get 67% of chunks from the local cache, 32% from the AZ level, and less than 0.1% from S3.

也因為 L3 cache 是 S3 的關係,他們在 L1 與 L2 上就不用擔心 durability 的問題 (反正不見了就往後面找):

The whole set of chunks are stored in S3, meaning the cache doesn’t need to provide durability, just low latency.

但還是用了 Erasure code,儘量維持每個 cache tier 在自己 tier 裡面就可以找到資料的機率,這樣可以盡量降低 peak latency (於是造成 99.9%/99.95%/99.99% 的 SLO 不好看?):

Think about what happens in a classic consistent hashed cache with 20 nodes when a node failure happens. Five percent of the data is lost. The hit rate drops to a maximum of 95%, which is a more than 5x increase in misses given that our normal hit rate is over 99%. At large scale machines fail all the time, and we don’t want big changes in behavior when that happens.

So we use a technique called erasure coding to completely avoid the impact. In erasure coding, we break each chunk up into M parts in a way that it can be recreated from any k. As long as M - k >= 1 we can survive the failure of any node with zero hit rate impact (because the other k nodes will pick up the slack).

大概是本來比較簡單的三層架構在 benchmark 後發現無法達成對應的 SLO,所以就「補上」erasure code 拉高 SLO,從這邊就可以感覺到老闆的要求對於架構設計上的影響...

話說難得看到一些細節被丟出來...

Redis 改變授權,變成非開源軟體

Redis 宣佈拿掉開源授權:「Redis Adopts Dual Source-Available Licensing」,對應的 git commit 在「Change license from BSD-3 to dual RSALv2+SSPLv1 (#13157)」這邊可以看到。

Starting with Redis 7.4, Redis will be dual-licensed under the Redis Source Available License (RSALv2) and Server Side Public License (SSPLv1).

算是今天蠻熱的新聞之一,不過算是在預期之內的變化,因為 Redis 在 2018 年就把很多他們自己開發的 proprietary component 變成 SSPL,現在主體也變其實不算太意外,後續就是看社群的 fork 凝聚的力量會比較大,還是 Redis 公司方的力量比較大... 尤其在 Redis 已經實作了許多 data structure 後,Redis 公司想要套現這件事情是否還有機會?

不過比較特別的反倒是微軟... 微軟早了一兩天發佈了 Redis 相容的實作 Garnet

Garnet is a remote cache-store from Microsoft Research that offers strong performance (throughput and latency), scalability, storage, recovery, cluster sharding, key migration, and replication features. Garnet can work with existing Redis clients.

會是巧合嗎?這時間點其實真的很微妙...

CloudFront 端出 Embedded Points of Presence

看到 CloudFront 的產品新聞稿:「Amazon CloudFront announces availability of Embedded Points of Presence」,AWS 在 CloudFront 上端出了 Embedded Points of Presence 服務,看名字就是更彈性的 CDN PoP,不過想知道更細節的東西得去看 FAQs 的部分...

從這段可以看到應該是 AWS 的 appliance,然後放到實體機房裡面提供服務:

These embedded POPs are owned and operated by Amazon and deployed in the last mile of the ISP/MNO networks to avoid capacity bottlenecks in congested networks that connect end viewers to content sources, improving performance.

比較特別的消息是,這個不會額外收費:

Q. Is there a separate charge for using embedded POPs?
No, there is no additional charge for using CloudFront embedded POPs.

另外這個服務會是 opt-in 選擇加入,但不需要額外設定 distribution,而且 CloudFront 會針對有 opt-in 的 distribution 自動混搭:

Embedded POPs are an opt-in capability intended for the delivery of large scale cacheable traffic. Please contact your AWS sales representative to evaluate if embedded POPs are suitable for your workloads.

No, you do not need to create a new distribution specifically for embedded POPs. If your workload is eligible, CloudFront will enable embedded POPs for your existing distribution upon request.

You don't have to choose between CloudFront embedded POPs or CloudFront POPs for content delivery. Once your CloudFront distribution is enabled for embedded POPs, CloudFront's routing system dynamically utilizes both CloudFront POPs and embedded POPs to deliver content, ensuring optimal performance for end users.

下一章「Compliance」的部分有提到 embedded POPs 是不包括在 PCI DSSHIPAA 以及 SOC 這些 compliance 的,所以也可以回頭看到在提到推薦掛上來的內容,有避開掉敏感服務,主要是以大家都會看到一樣的內容的東西為主:

Embedded POPs are custom built to deliver large scale live-streaming events, video-on-demand (VOD), and game downloads.

看起來有點像是 NetflixOpen Connect 或是 GoogleGGC,讓 ISP 或是 MNO 可以放 cache service 降低對外消耗的流量。

這應該會回到老問題,ISP/MNO 當然是希望 CloudFront 花錢放機器進來,不會是 ISP/MNO 自己申請放,這不是技術問題而是商業問題...

Firefox 與 Chrome 處理 Intermediate CA 的不同方式

Fediverse 上看到「The recording of my "Browsers biggest TLS Mistake" lightning talk at #37C3」這個,這是出自 37C3 的 lightning talk,影片不長,只有五分鐘,可以在「Browsers biggest TLS mistake」這邊看到。

正常的 HTTPS server 會送出 Intermediate CA certificate 與自己的 TLS certificate:

當伺服器端沒有設定好,通常是只送出自己的 TLS certificate:

這種情況在 Firefox 裡有處理,軟體本身會預載所有的 Intermediate CA 避免這種問題 (當然會需要跟著軟體更新),這點在三年前有提到過:「Firefox 試著透過預載 Intermediate CA 降低連線錯誤發生的機率?」,也就是這張投影片提到的情況:

Chrome 則是去看目前的 cache 資料,找看看是不是在其他地方有看到適合的 Intermediate CA 可以接起來:

這好像可以解釋為什麼之前遇到類似的問題的時候,在 Chrome 上面會需要進 chrome:// 裡面清東西才能重製...

arXiv 上了 Fastly CDN

看到 arXiv 宣佈上了 FastlyCDN:「Faster arXiv with Fastly」。

翻了一下 arxiv.org 的 DNS record,可以看到現在是這樣:

;; ANSWER SECTION:
arxiv.org.              10      IN      A       151.101.131.42
arxiv.org.              10      IN      A       151.101.3.42
arxiv.org.              10      IN      A       151.101.67.42
arxiv.org.              10      IN      A       151.101.195.42

mtr 測試,看起來 HiNet 過去的 routing 還是進到新加坡。

不過 static.arxiv.org 是在 CloudFront 上:

;; ANSWER SECTION:
static.arxiv.org.       3600    IN      CNAME   daa2ks08y5ls.cloudfront.net.
daa2ks08y5ls.cloudfront.net. 60 IN      A       13.35.35.100
daa2ks08y5ls.cloudfront.net. 60 IN      A       13.35.35.29
daa2ks08y5ls.cloudfront.net. 60 IN      A       13.35.35.88
daa2ks08y5ls.cloudfront.net. 60 IN      A       13.35.35.127

依照官方的說明看起來還在換,只是不知道已經在 CloudFront 上的 (像是上面提到的 static.arxiv.org) 會不會換過去:

That includes our home page, listings, abstracts, and papers — both PDF and HTML (more on that soon).

AMD Zen 3 與 Zen 4 上 FSRM (Fast Short REP MOV) 的效能問題

前幾天 Hacker News 上討論到的一篇:「Rust std fs slower than Python? No, it's hardware (xuanwo.io)」,原文則是在「Rust std fs slower than Python!? No, it's hardware!」。

原因是作者收到回報,提到一段 Rust 寫的 code (在文章裡面的 read_file_with_opendal(),透過 OpenDAL 去讀) 比 Python 的 code 還慢 (在文章裡面的 read_file_with_normal(),直接用 Python 的 open() 開然後讀取)。

先講最後發現問題是 Zen 3 (桌機版 5 系列的 CPU) 與 Zen 4 (桌機版 7 系列的 CPU) 這兩個架構上 REP MOV 系列的指令在某些情境下 (與 offset 有關) 有效能上的問題。

FSRM 類的指令被用在 memcpy()memmove() 類的地方,算是很常見備用到的功能,這次追蹤的問題發現在 glibc 裡面用到導致效能異常。

另外也可以查到在 Linux kernel 裡面也有用到:「Linux 5.6 To Make Use Of Intel Ice Lake's Fast Short REP MOV For Faster memmove()」,所以後續應該也會有些改善的討論...

Ubuntu 這邊的 issue ticket 開在「Terrible memcpy performance on Zen 3 when using rep movsb」這,上游的 glibc 也有對應的追蹤:「30995 – Zen 4: sub-optimal memcpy on very large copies」。

從作者私下得知的消息,因為 patch space 的大小限制,AMD 可能無法提供 CPU microcode 上的 patch,直接解決問題:

However, unverified sources suggest that a fix via amd-ucode is unlikely (at least for Zen 3) due to limited patch space. If you have more information on this matter, please reach out to me.

所以目前比較可行的作法是在 glibc 裡面使用到 FSRM 的地方針對 Zen 3 與 Zen 4 放 workaround,回到原來沒有 FSRM 的方式處理:

Our only hope is to address this issue in glibc by disabling FSRM as necessary. Progress has been made on the glibc front: x86: Improve ERMS usage on Zen3. Stay tuned for updates.

另外在追蹤問題的過程遇到不同的情境,得拿出不同的 profiling 工具出來用,所以也還蠻值得看過一次有個印象:

一開始的 timeit 算是 Python 裡面簡單的 benchmark library:

接著的比較是用 command line 的工具 hyperfine 產生出來的 (給兩個 command 讓他跑),查了一下發現在 Ubuntu 官方的 apt repository 裡面有包進去 (22.04+):

再來是用 strace 追問題,這個算是經典工具了,可以拿來看 syscall 被呼叫的時間點:

到後面出現了 perf 可以拿來看更底層的資訊,像是 CPU 內 cache 的情況:

接續提到的「hotspot ASM」應該也還是 perf 輸出的格式,不過不是那麼確定... 在「perf Examples」這邊可以看到 function 的分析:

而文章裡的則是可以看到已經到 assembly 層級了:

差不多就這些...

升級到 WordPress 6.3 遇到的問題

WordPress 6.3 出了:「WordPress 6.3 “Lionel”」,順手按了升級就爛掉了,發現是 admin 介面爛掉,網站本身還能瀏覽,先翻出 PHP 的錯誤訊息:

2023/08/09 04:52:32 [error] 845702#845702: *1350433 FastCGI sent in stderr: "PHP message: PHP Fatal error:  Uncaught ArgumentCountError: Too few arguments to function W3TC\Util_Environment::is_dbcluster(), 0 passed in /srv/blog.gslin.org/public/wp-content/db.php on line 56 and exactly 1 expected in /srv/blog.gslin.org/public/wp-content/plugins/w3-total-cache/Util_Environment.php:176

順著這個訊息找到這個算新的討論:「Fatal error in Util_Environment」,看起來跟 W3 Total Cache 有關。

(話說這個錯誤訊息還比較新,Kagi 上面找不到,後來是在 DuckDuckGo 上找到)

解法就比較粗暴一點,先到 wp-content/plugins 下讓 W3 Total Cache 失效:

sudo chmod 000 w3-total-cache

跑完升級後會出現錯誤訊息提示 W3 Total Cache 沒有啟用,這時候再開回來:

sudo chmod 755 w3-total-cache

後續把 cache 都清掉就正常了。

KeyDB:使用 Multithreading 改善 Redis 的效能

Hacker News 上看到有支援 Multithreading 的 Redis fork:「KeyDB – A Multithreaded Fork of Redis (keydb.dev)」,官網在「KeyDB - The Faster Redis Alternative」這邊。

不過這篇是要記錄從 Hacker News 看到的雷點,這樣以後自己再找資料的時候會比較容找到。

36022425 這篇是跳下去用發現不太行,最後在 application 端實作需要的 feature,後端還是用原廠的 Redis:

To counter what the other active business said, we tried using KeyDB for about 6 months and every fear you concern you stated came true. Numerous outages, obscure bugs, etc. Its not that the devs aren’t good, its just a complex problem and they went wide with a variety of enhancements. We changed client architecture to work better with tradition Redis. Combined with with recent Redis updates, its rock solid and back to being an after-thought rather than a pain point. Its only worth the high price if it solves problems without creating worse ones. I wish those guys luck but I wont try it again anytime soon.

* its been around 2 years since our last use as a paying customer. YMMV.

另外是在專案裡搜尋「is:open is:issue label:"Priority 1"」的結果可以看到不太妙,在 36021108 這邊有提到的問題:

Filed July, eventually marked priority 1 in early December, not a single comment or signs of fix on it since. That doesn't look good at all.

然後 36020184 有提到 Snap 買進去後沒有什麼在管 open source project 的部分了:

I think I'll stay far away from this thing anyway. Numerous show-stopper bug reports open and there hasn't been a substantial commit on the main branch in at least a few weeks, and possibly months. I'll be surprised if Snap is actually paying anybody to work on this.

llama.cpp 的載入速度加速

Hacker News 上看到「Llama.cpp 30B runs with only 6GB of RAM now (github.com/ggerganov)」這個消息,原 pull request 在「Make loading weights 10-100x faster #613」這邊。

這個 PR 的作者 Justine Tunney 在 PR 上有提到他改變 model 檔案格式,以便改用 mmap(),大幅降低了需要預先讀取的時間 (因為變成 lazy-loading style),而且這也讓系統可以利用 cache page,避免了 double buffering 的問題:

This was accomplished by changing the file format so we can mmap() weights directly into memory without having to read() or copy them thereby ensuring the kernel can make its file cache pages directly accessible to our inference processes; and secondly, that the file cache pages are much less likely to get evicted (which would force loads to hit disk) because they're no longer competing with memory pages that were needlessly created by gigabytes of standard i/o.

這讓我想到在資料庫領域中,PostgreSQL 也會用 mmap() 操作,有點類似的概念。

另外 Justine Tunney 在這邊的 comment 有提到一個意外觀察到的現象,他發現實際在計算的時候用到的 model 內容意外的少:他用一個簡單的 prompt 測試,發現 20GB 的 30B model 檔案在他的 Intel 機器上實際只用到了 1.6GB 左右:

If I run 30B on my Intel machine:

[...]

As we can see, 400k page faults happen, which means only 1.6 gigabytes ((411522 * 4096) / (1024 * 1024)) of the 20 gigabyte weights file actually needed to be used.

這點他還在懷疑是不是他的修改有 bug,但目前他覺得不太像,也看不太出來:

Now, since my change is so new, it's possible my theory is wrong and this is just a bug. I don't actually understand the inner workings of LLaMA 30B well enough to know why it's sparse. Maybe we made some kind of rare mistake where llama.cpp is somehow evaluating 30B as though it were the 7B model. Anything's possible, however I don't think it's likely. I was pretty careful in writing this change, to compare the deterministic output of the LLaMA model, before and after the Git commit occurred. I haven't however actually found the time to reconcile the output of LLaMA C++ with something like PyTorch. It'd be great if someone could help with that, and possibly help us know why, from more a data science (rather than systems engineering perspective) why 30B is sparse.

如果不是 bug 的話,這其實冒出了一個很有趣的訊號,表示這些 model 是有可能再瘦身的?

npm 裡的 redis 與 ioredis

前幾天在噗浪的偷偷說上看到有人提到 npmtrends 上的 redis (官方的) 與 ioredis:「https://www.plurk.com/p/p6wdc9」。

意外發現以下載量來看,ioredis 已經超越官方的 redis 了:

找了一下差異,看起來的確有些團隊在 loading 很高的情況下會考慮用 ioredis 取代 redis:「Migrating from Node Redis to Ioredis: a slightly bumpy but faster road」。

但沒有特別需求的話應該還是會用官方版本?