在 Lobsters 上看到的老文章:「[Cache Architecture for] Container Loading in AWS Lambda」,原文從 url 看起來是去年五月發表的資訊了:「Container Loading in AWS Lambda」。
主要是在講 container 怎麼 load 才會儘快執行,首先是提到了大家常用的 layer cache,在 AWS Lambda 上則是改用了 block level cache:
Most of the existing systems do this at the layer or file level, but we chose to do it at the block level.
然後每一塊 512KB:
We unpack a snapshot (deterministically, which turns out to be tricky) into a single flat filesystem, then break that filesystem up into 512KiB chunks.
接著是提到 lazy load 的方式:「Slacker: Fast Distribution with Lazy Docker Containers」:
Our analysis shows that pulling packages accounts for 76% of container start time, but only 6.4% of that data is read.
Slacker speeds up the median container development cycle by 20x and deployment cycle by 5x.
而這個技巧也被用在 AWS Lambda 上,而且是透過 FUSE 實作:
In Lambda, we did this by taking advantage of the layer of abstraction that Firecracker provides us. Linux has a useful feature called FUSE provides an interface that allows writing filesystems in userspace (instead of kernel space, which is harder to work in).
另外一個 AWS Lambda 有實作的是 tiered caching,分成三層,包括了 worker 的 local cache (L1)、同一個 AZ 上的 cache (L2) 以及 S3 上的資料 (L3):
Despite our local on-worker (L1) cache being several orders of magnitude smaller than the AZ-level cache (L2) and that being much smaller than the full data set in S3 (L3), we still get 67% of chunks from the local cache, 32% from the AZ level, and less than 0.1% from S3.
也因為 L3 cache 是 S3 的關係,他們在 L1 與 L2 上就不用擔心 durability 的問題 (反正不見了就往後面找):
The whole set of chunks are stored in S3, meaning the cache doesn’t need to provide durability, just low latency.
但還是用了 Erasure code,儘量維持每個 cache tier 在自己 tier 裡面就可以找到資料的機率,這樣可以盡量降低 peak latency (於是造成 99.9%/99.95%/99.99% 的 SLO 不好看?):
Think about what happens in a classic consistent hashed cache with 20 nodes when a node failure happens. Five percent of the data is lost. The hit rate drops to a maximum of 95%, which is a more than 5x increase in misses given that our normal hit rate is over 99%. At large scale machines fail all the time, and we don’t want big changes in behavior when that happens.
So we use a technique called erasure coding to completely avoid the impact. In erasure coding, we break each chunk up into M parts in a way that it can be recreated from any k. As long as M - k >= 1 we can survive the failure of any node with zero hit rate impact (because the other k nodes will pick up the slack).
大概是本來比較簡單的三層架構在 benchmark 後發現無法達成對應的 SLO,所以就「補上」erasure code 拉高 SLO,從這邊就可以感覺到老闆的要求對於架構設計上的影響...
話說難得看到一些細節被丟出來...