Amazon S3 變成 Strong Consistency 背後的改善方式

看到 Hacker News 上的討論「Diving Deep on S3 Consistency (allthingsdistributed.com)」才想到該整理一下,原文的「Diving Deep on S3 Consistency」是 Amazon 的 CTO Werner Vogels 花了一些篇幅描述 Amazon S3 怎麼把 Eventually Consistent 變成 Strongly Consistent,當初 Amazon S3 公告時我也有寫一篇文章提到:「Amazon S3 現在變成 Strong Read-After-Write Consistency 啦...」。

Amazon S3 之所以會是 Eventually Consisient 是因為 Metadata Subsystem 的 cache 設計:

Per-object metadata is stored within a discrete S3 subsystem. This system is on the data path for GET, PUT, and DELETE requests, and is responsible for handling LIST and HEAD requests. At the core of this system is a persistence tier that stores metadata. Our persistence tier uses a caching technology that is designed to be highly resilient. S3 requests should still succeed even if infrastructure supporting the cache becomes impaired. This meant that, on rare occasions, writes might flow through one part of cache infrastructure while reads end up querying another. This was the primary source of S3’s eventual consistency.

如果要解決 Eventually Consistent,最直接的想法是拔掉 cache,但這樣對效能的影響太大,所以得在要保留 cache 的情況下設計,所以就想到用其他管道確保 cache 裡的資料狀態是正確的:

One early consideration for delivering strong consistency was to bypass our caching infrastructure and send requests directly to the persistence layer. But this wouldn’t meet our bar for no tradeoffs on performance. We needed to keep the cache. To keep values properly synchronized across cores, CPUs implement cache coherence protocols. And that’s what we needed here: a cache coherence protocol for our metadata caches that allowed strong consistency for all requests.

而接下來是設計一連串的邏輯確保每個 S3 object 的操作都有 serializability:

We had introduced new replication logic into our persistence tier that acts as a building block for our at-least-once event notification delivery system and our Replication Time Control feature. This new replication logic allows us to reason about the “order of operations” per-object in S3. This is the core piece of our cache coherency protocol.

後面又要確保這個 cache coherence 的 HA,最後要能夠驗證實做上的正確性,花的力氣比實做協定本身還多:

These verification techniques were a lot of work. They were more work, in fact, than the actual implementation itself. But we put this rigor into the design and implementation of S3’s strong consistency because that is what our customers need.

Amazon S3 算是 AWS 當初推出來的招牌,當時的 Amazon S3 底層的論文「Amazon's Dynamo」劇烈影響了後來整個產業 (雖然論文裡面是拿 Amazon 的購物車說明),這次的補充算是更新了原來論文的技術,告訴大家本來的 Eventually Consistent 是可以再拉到 Strongly Consistent。

Akamai 也推出了 Key-Value 服務 EdgeKV

沒介紹過 Akamai 的一些架構,先講到 Akamai 的 Edge 端 Serverless 架構是 EdgeWorkers,跑的是 JavaScript:

EdgeWorkers lets developers just code — integrating into existing CI/CD workflows and enabling multiple teams to work in parallel using JavaScript. EdgeWorkers eliminates the hassle of managing compute resources and building for scale.

然後這次推出的是 EdgeKV,目前還在 Beta 版:「Serverless Storage at the Edge (EdgeKV Beta)」。

如同名字所說的,架構上 Key-Value 架構,放棄了 CAP theorem 裡面的 C,改走 Eventual Consistency:

EdgeKV uses what is known in distributing computing as an eventual consistency model to perform writes and updates. This model achieves high availability with low read latency by propagating data writes globally. The period of time it takes the system to distribute data globally is called the “inconsistency window”.

隔壁 Cloudflare Workers KV 也是 Eventual Consistency (出自「How KV works」這邊):

KV achieves this performance by being eventually-consistent. Changes are immediately visible in the edge location at which they're made, but may take up to 60 seconds to propagate to all other edge locations.

看起來算是補上競爭對手的產品線...

Amazon DynamoDB Accelerator (DAX)

DynamoDB 推出的新架構,在系統上幫忙處理 cache:「Amazon DynamoDB Accelerator (DAX) – In-Memory Caching for Read-Intensive Workloads」。

DAX 跟現有的 DynamoDB API 相容:

DAX is a fully managed caching service that sits (logically) in front of your DynamoDB tables. It operates in write-through mode, and is API-compatible with DynamoDB.

因為 cache 的緣故,會是 eventually-consistent 架構:

Responses are returned from the cache in microseconds, making DAX a great fit for eventually-consistent read-intensive workloads.

然後是 r3 系列的機器組成的,限制在十台 (冒出大大的問號):

Each DAX cluster can contain 1 to 10 nodes; you can add nodes in order to increase overall read throughput. The cache size (also known as the working set) is based on the node size (dax.r3.large to dax.r3.8xlarge) that you choose when you create the cluster. Clusters run within a VPC, with nodes spread across Availability Zones.

不是很清楚這樣的好處 (比起自己用 memcached 或是其他類似的 cache 架構),也許過幾天想通了會開竅... :o

EMR 對 S3 Consistency 的補強

今年一月的時候,Netflix 曾經寫過一篇關於對 S3 的 Eventually Consistency 的問題:「Netflix 對 S3 的 Eventually Consistency 的補強...」,當時 Netflix 的作法是實做 s3mper 以確保一致性。

過了半年,AWS 的人在 EMR 上實做了類似的功能:「Consistent View for Elastic MapReduce's File System」。

看文章的說明,應該是用到 DynamoDB 負責 S3 上資料的狀態,而 DynamoDB 的資料並不會砍掉,所以在使用時要注意這點 :o

Netflix 對 S3 的 Eventually Consistency 的補強...

眾所皆知的,Netflix 幾乎將所有服務都放在 AWS 上,這當然也包括了 Amazon S3

在 Amazon S3 上會有 Eventually Consistency 的問題:寫入後可能會讀到舊的資料,於是就算錯資料了...

Netflix 的人討論了幾種方案,後來開發 s3mper 用來解決 Amazon S3 的 Eventually Consistency 問題:「S3mper: Consistency in the Cloud」。

s3mper 透過 AWS DynamoDB 儲存檔案的 metadata,藉以得知是否 consistency。而 Amazon DynamoDB 本身雖然也是 Eventually Consistency,但多了 API 可以得知是否 Consistency。

Supported Operations in DynamoDB 可以看到 Data Read and Consistency Considerations 這段提供了兩種 read mode:

  • Eventually Consistent Reads
  • Strongly Consistent Reads

在 Strongly Consistent Reads 中,可以確認讀到的是不是最新的資料。只有當 DynamoDB 與 S3 的資料都正確時才繼續往下跑...

這個解法相當於在 Amazon S3 上面架了一層防護網,算是 workaround 吧 :p 如果 Amazon S3 可以提供 consistency 資訊的話,也就不用這樣搞了...