Amazon S3 變成 Strong Consistency 背後的改善方式

看到 Hacker News 上的討論「Diving Deep on S3 Consistency (allthingsdistributed.com)」才想到該整理一下,原文的「Diving Deep on S3 Consistency」是 Amazon 的 CTO Werner Vogels 花了一些篇幅描述 Amazon S3 怎麼把 Eventually Consistent 變成 Strongly Consistent,當初 Amazon S3 公告時我也有寫一篇文章提到:「Amazon S3 現在變成 Strong Read-After-Write Consistency 啦...」。

Amazon S3 之所以會是 Eventually Consisient 是因為 Metadata Subsystem 的 cache 設計:

Per-object metadata is stored within a discrete S3 subsystem. This system is on the data path for GET, PUT, and DELETE requests, and is responsible for handling LIST and HEAD requests. At the core of this system is a persistence tier that stores metadata. Our persistence tier uses a caching technology that is designed to be highly resilient. S3 requests should still succeed even if infrastructure supporting the cache becomes impaired. This meant that, on rare occasions, writes might flow through one part of cache infrastructure while reads end up querying another. This was the primary source of S3’s eventual consistency.

如果要解決 Eventually Consistent,最直接的想法是拔掉 cache,但這樣對效能的影響太大,所以得在要保留 cache 的情況下設計,所以就想到用其他管道確保 cache 裡的資料狀態是正確的:

One early consideration for delivering strong consistency was to bypass our caching infrastructure and send requests directly to the persistence layer. But this wouldn’t meet our bar for no tradeoffs on performance. We needed to keep the cache. To keep values properly synchronized across cores, CPUs implement cache coherence protocols. And that’s what we needed here: a cache coherence protocol for our metadata caches that allowed strong consistency for all requests.

而接下來是設計一連串的邏輯確保每個 S3 object 的操作都有 serializability:

We had introduced new replication logic into our persistence tier that acts as a building block for our at-least-once event notification delivery system and our Replication Time Control feature. This new replication logic allows us to reason about the “order of operations” per-object in S3. This is the core piece of our cache coherency protocol.

後面又要確保這個 cache coherence 的 HA,最後要能夠驗證實做上的正確性,花的力氣比實做協定本身還多:

These verification techniques were a lot of work. They were more work, in fact, than the actual implementation itself. But we put this rigor into the design and implementation of S3’s strong consistency because that is what our customers need.

Amazon S3 算是 AWS 當初推出來的招牌,當時的 Amazon S3 底層的論文「Amazon's Dynamo」劇烈影響了後來整個產業 (雖然論文裡面是拿 Amazon 的購物車說明),這次的補充算是更新了原來論文的技術,告訴大家本來的 Eventually Consistent 是可以再拉到 Strongly Consistent。

Akamai 也推出了 Key-Value 服務 EdgeKV

沒介紹過 Akamai 的一些架構,先講到 Akamai 的 Edge 端 Serverless 架構是 EdgeWorkers,跑的是 JavaScript:

EdgeWorkers lets developers just code — integrating into existing CI/CD workflows and enabling multiple teams to work in parallel using JavaScript. EdgeWorkers eliminates the hassle of managing compute resources and building for scale.

然後這次推出的是 EdgeKV,目前還在 Beta 版:「Serverless Storage at the Edge (EdgeKV Beta)」。

如同名字所說的,架構上 Key-Value 架構,放棄了 CAP theorem 裡面的 C,改走 Eventual Consistency:

EdgeKV uses what is known in distributing computing as an eventual consistency model to perform writes and updates. This model achieves high availability with low read latency by propagating data writes globally. The period of time it takes the system to distribute data globally is called the “inconsistency window”.

隔壁 Cloudflare Workers KV 也是 Eventual Consistency (出自「How KV works」這邊):

KV achieves this performance by being eventually-consistent. Changes are immediately visible in the edge location at which they're made, but may take up to 60 seconds to propagate to all other edge locations.

看起來算是補上競爭對手的產品線...

Amazon S3 現在變成 Strong Read-After-Write Consistency 啦...

看到 Amazon S3 宣佈 Strong Read-After-Write Consistency 了:「Amazon S3 Update – Strong Read-After-Write Consistency」。

這個問題從很久前就被討論過:

所以到這次更新之前,只有新增的 object 會保證馬上出現。現在則是 update 也會:

Effective immediately, all S3 GET, PUT, and LIST operations, as well as operations that change object tags, ACLs, or metadata, are now strongly consistent. What you write is what you will read, and the results of a LIST will be an accurate reflection of what’s in the bucket. This applies to all existing and new S3 objects, works in all regions, and is available to you at no extra charge! There’s no impact on performance, you can update an object hundreds of times per second if you’d like, and there are no global dependencies.

要注意這邊沒有提到 DELETE,所以有可能 DELETE + GET 的操作還是沒有到 strong consistency,不過句子本身意思不是很清晰,也許這幾天會有人在 forum 上面問然後有答案...

另外從公告裡面提到 Amazon EMR 團隊,看起來是 Amazon EMR 團隊一直在內部戳 Amazon S3 的團隊改善:

We’ve been working with the Amazon EMR team and developers in the open-source community to ensure that customers can take advantage of this update with their big data workloads. As a result of that you no longer need to use EMRFS Consistent View or S3Guard, further reducing the cost to run big data workloads in AWS.

在 ext4 上的 CCFS

在「Application crash consistency and performance with CCFS」這篇看到的東西。

CCFS 目標是拉高 ext4 的 data integrity,並且還是有高效能:

CCFS (the Crash-Consistent File System) is an extension to ext4 that restores ordering and weak atomicity guarantees for applications, while at the same time delivering much improved performance.

如果你需要絕對的 data integrity,你需要用 data=journal 確保資料可以在 system crash 後被 replay,預設的 data=ordered 是無法達到的,而 CCFS 也沒打算達到絕對的 data integrity,而是盡量達到。所以在測試上可以發現 CCFS 大幅改善了 data integrity:

而效能還提昇了 (喂喂):

這真是太神奇了...

翻了一下好像沒 open source 出來 (至少現在沒看到),來等看看有沒有人會實做出來...

Google 的 Cloud Spanner

GoogleCloud Spanner 這個服務拿出來賣了:「Introducing Cloud Spanner: a global database service for mission-critical applications」,以及說明的「Inside Cloud Spanner and the CAP Theorem」。

Cloud Spanner 的規劃上是希望有 RDBMS 的能力 (像是 ACID 特性),又有強大的擴充能力 (scalability) 與可用性 (availability):

Today, we’re excited to announce the public beta for Cloud Spanner, a globally distributed relational database service that lets customers have their cake and eat it too: ACID transactions and SQL semantics, without giving up horizontal scaling and high availability.

在說明裡有提到 Cloud Spanner 是做到 CAP theorem 裡面的 CP:

The purist answer is “no” because partitions can happen and in fact have happened at Google, and during some partitions, Spanner chooses C and forfeits A. It is technically a CP system.

然後把 A 拉高到使用者不會在意 downtime 的程度:

However, no system provides 100% availability, so the pragmatic question is whether or not Spanner delivers availability that is so high that most users don't worry about its outages.

當然,比較讓人爭議的是 Twitter 上 Google Cloud 官方帳號的 tweet,直接講同時解決了 CAP 三個條件:

價錢不算便宜,不過對於想要找方案的人至少有選擇...

Amazon S3 的改善

在「Amazon S3 Introduces New Usability Enhancements」這邊提到了 Amazon S3 的兩個改善。

第一個是業務面的改善,以前應該是開 support ticket 請人調整 S3 bucket 數量上限,現在則是可以直接透過界面申請?(沒有遇過瓶頸,不知道以前是不是在界面上看不到...)

第二個才是重頭戲:Read-after-write Consistency。

With this enhancement, Amazon S3 now supports read-after-write consistency in all regions for new objects added to Amazon S3.

也就是說,Amazon S3 現在保證「新增的 object」「可以在建立後馬上被讀取」。之前還沒修正前,這個問題有多嚴重呢?可以看 2014 年時「Netflix 對 S3 的 Eventually Consistency 的補強...」這邊 Netflix 在跑 PigHive 遇到的問題。

Netflix 這邊舉的例子是兩個 Pig cluster 在跑,其中 Pig-2 需要 Pig-1 跑出來的資料,在這次公告前,如果 Pig-1 的資料寫回 Amazon S3 時不會馬上出現,那麼 Pig-2 就會拿不完整的資料執行:

Pig-2 is activated based on the completion of Pig-1 and immediately lists the output directories of the previous task. If the S3 listing is incomplete when the second job starts, it will proceed with incomplete data.

而現在總算是保證新的 object 可以馬上被讀取,所以 Netflix 可以利用一個檔案列出所有的 filename,確保知道所有的檔案名稱... (LIST 指令還是 eventually consistent,所以這部份還是要自己處理)

EMR 對 S3 Consistency 的補強

今年一月的時候,Netflix 曾經寫過一篇關於對 S3 的 Eventually Consistency 的問題:「Netflix 對 S3 的 Eventually Consistency 的補強...」,當時 Netflix 的作法是實做 s3mper 以確保一致性。

過了半年,AWS 的人在 EMR 上實做了類似的功能:「Consistent View for Elastic MapReduce's File System」。

看文章的說明,應該是用到 DynamoDB 負責 S3 上資料的狀態,而 DynamoDB 的資料並不會砍掉,所以在使用時要注意這點 :o

Netflix 對 S3 的 Eventually Consistency 的補強...

眾所皆知的,Netflix 幾乎將所有服務都放在 AWS 上,這當然也包括了 Amazon S3

在 Amazon S3 上會有 Eventually Consistency 的問題:寫入後可能會讀到舊的資料,於是就算錯資料了...

Netflix 的人討論了幾種方案,後來開發 s3mper 用來解決 Amazon S3 的 Eventually Consistency 問題:「S3mper: Consistency in the Cloud」。

s3mper 透過 AWS DynamoDB 儲存檔案的 metadata,藉以得知是否 consistency。而 Amazon DynamoDB 本身雖然也是 Eventually Consistency,但多了 API 可以得知是否 Consistency。

Supported Operations in DynamoDB 可以看到 Data Read and Consistency Considerations 這段提供了兩種 read mode:

  • Eventually Consistent Reads
  • Strongly Consistent Reads

在 Strongly Consistent Reads 中,可以確認讀到的是不是最新的資料。只有當 DynamoDB 與 S3 的資料都正確時才繼續往下跑...

這個解法相當於在 Amazon S3 上面架了一層防護網,算是 workaround 吧 :p 如果 Amazon S3 可以提供 consistency 資訊的話,也就不用這樣搞了...