Amazon S3 變成 Strong Consistency 背後的改善方式

看到 Hacker News 上的討論「Diving Deep on S3 Consistency (」才想到該整理一下,原文的「Diving Deep on S3 Consistency」是 Amazon 的 CTO Werner Vogels 花了一些篇幅描述 Amazon S3 怎麼把 Eventually Consistent 變成 Strongly Consistent,當初 Amazon S3 公告時我也有寫一篇文章提到:「Amazon S3 現在變成 Strong Read-After-Write Consistency 啦...」。

Amazon S3 之所以會是 Eventually Consisient 是因為 Metadata Subsystem 的 cache 設計:

Per-object metadata is stored within a discrete S3 subsystem. This system is on the data path for GET, PUT, and DELETE requests, and is responsible for handling LIST and HEAD requests. At the core of this system is a persistence tier that stores metadata. Our persistence tier uses a caching technology that is designed to be highly resilient. S3 requests should still succeed even if infrastructure supporting the cache becomes impaired. This meant that, on rare occasions, writes might flow through one part of cache infrastructure while reads end up querying another. This was the primary source of S3’s eventual consistency.

如果要解決 Eventually Consistent,最直接的想法是拔掉 cache,但這樣對效能的影響太大,所以得在要保留 cache 的情況下設計,所以就想到用其他管道確保 cache 裡的資料狀態是正確的:

One early consideration for delivering strong consistency was to bypass our caching infrastructure and send requests directly to the persistence layer. But this wouldn’t meet our bar for no tradeoffs on performance. We needed to keep the cache. To keep values properly synchronized across cores, CPUs implement cache coherence protocols. And that’s what we needed here: a cache coherence protocol for our metadata caches that allowed strong consistency for all requests.

而接下來是設計一連串的邏輯確保每個 S3 object 的操作都有 serializability:

We had introduced new replication logic into our persistence tier that acts as a building block for our at-least-once event notification delivery system and our Replication Time Control feature. This new replication logic allows us to reason about the “order of operations” per-object in S3. This is the core piece of our cache coherency protocol.

後面又要確保這個 cache coherence 的 HA,最後要能夠驗證實做上的正確性,花的力氣比實做協定本身還多:

These verification techniques were a lot of work. They were more work, in fact, than the actual implementation itself. But we put this rigor into the design and implementation of S3’s strong consistency because that is what our customers need.

Amazon S3 算是 AWS 當初推出來的招牌,當時的 Amazon S3 底層的論文「Amazon's Dynamo」劇烈影響了後來整個產業 (雖然論文裡面是拿 Amazon 的購物車說明),這次的補充算是更新了原來論文的技術,告訴大家本來的 Eventually Consistent 是可以再拉到 Strongly Consistent。

Amazon EC2 提供跨區直接複製 AMI (Image) 的功能

Amazon EC2AMI 可以跨區複製了:「Amazon EC2 now allows you to copy Amazon Machine Images across AWS GovCloud, AWS China and other AWS Regions」。

如同公告提到的,在這個功能出來以前,想要產生一樣的 image 得重新在 build 一份:

Previously, to copy AMIs across these AWS regions, you had to rebuild the AMI in each of them. These partitions enabled data isolation but often made this copy process complex, time-consuming and expensive.

有一些限制,image 大小必須在 1TB 以下,另外需要存到 S3 上,不過這些限制應該是還好:

This feature provides a packaged format that allows AMIs of size 1TB or less to be stored in AWS Simple Storage Service (S3) and later moved to any other region.

然後目前只有透過 cli 操作的方式,或是直接用 SDK 呼叫 API,看起來 web console 還沒提供:

This functionality is available through the AWS Command Line Interface (AWS CLI) and the AWS Software Development Kit (AWS SDK). To learn more about copying AMIs across these partitions, please refer to the documentation.

EFS 上可以掛 AWS Transfer Family 了

先前 AWS Transfer Family 的後端只能是 Amazon S3,現在則是宣佈可以掛 Amazon EFS 了:「New – AWS Transfer Family support for Amazon Elastic File System」。

EFS 跟 S3 都是沒有空間限制,但 EFS 可以直接在系統上掛起來當作一般的檔案系統用,基本上就是更方便,不過代價就是單位儲存成本貴不少...

這次支援 EFS 對於一些量不大的處理又方便不少,也就是處理完後的檔案另外丟,而上傳上來的檔案可以砍掉的... 如果是上傳上來的檔案需要保留的,用 S3 會比較適合。

Amazon S3 現在變成 Strong Read-After-Write Consistency 啦...

看到 Amazon S3 宣佈 Strong Read-After-Write Consistency 了:「Amazon S3 Update – Strong Read-After-Write Consistency」。


所以到這次更新之前,只有新增的 object 會保證馬上出現。現在則是 update 也會:

Effective immediately, all S3 GET, PUT, and LIST operations, as well as operations that change object tags, ACLs, or metadata, are now strongly consistent. What you write is what you will read, and the results of a LIST will be an accurate reflection of what’s in the bucket. This applies to all existing and new S3 objects, works in all regions, and is available to you at no extra charge! There’s no impact on performance, you can update an object hundreds of times per second if you’d like, and there are no global dependencies.

要注意這邊沒有提到 DELETE,所以有可能 DELETE + GET 的操作還是沒有到 strong consistency,不過句子本身意思不是很清晰,也許這幾天會有人在 forum 上面問然後有答案...

另外從公告裡面提到 Amazon EMR 團隊,看起來是 Amazon EMR 團隊一直在內部戳 Amazon S3 的團隊改善:

We’ve been working with the Amazon EMR team and developers in the open-source community to ensure that customers can take advantage of this update with their big data workloads. As a result of that you no longer need to use EMRFS Consistent View or S3Guard, further reducing the cost to run big data workloads in AWS.

AWS 推出了 Amazon S3 Storage Lens 可以看 S3 使用的概況

AWS 推出了 Amazon S3 Storage Lens,可以看 S3 使用的概況:「Introducing Amazon S3 Storage Lens – Organization-wide Visibility Into Object Storage」。

要使用者個功能需要授權 Amazon S3 Storage Lens 一些權限,照著說明去 IAM 開就可以了,開好後要等他一陣子,他需要去分析記錄才能產出 dashboard。

有免費版與付費版可以用,付費版的部份目前看到都是「$0.20 per million objects monitored per month」,但沒把所有的區域都翻完,所以不確定。

我自己看了一下免費版提供的預設 dashboard,就已經給出不少好用的資訊了,像是 30 天內的物件數與空間使用率變化,可以抓到一些成長數量的感覺。


MariaDB 的 S3 Engine 效能測試

PerconaMariaDB 在 10.5 (目前的最新穩定版) 裡出的 S3 Engine 給出了簡單的測試報告:「MariaDB S3 Engine: Implementation and Benchmarking」。

這個 engine 顧名思義就是把資料丟到 Amazon S3 上,目前是 alpha 版本,預設是不會載入的,需要開 alpha flag 才能用:

The S3 engine is READ_ONLY so you can’t perform any write operations ( INSERT/UPDATE/DELETE ), but you can change the table structure.

另外這是從 Aria 改出來的 read-only engine,而 Aria 是從 MyISAM 改出來的:

The S3 storage engine is based on the Aria code and the main feature is that you can directly move your table from a local device to S3 using ALTER.

測出來發現在 read-only 的情境下,COUNT(*) 超快,看起來就是跟 MyISAM 體系有關,直接撈 MyISAM 內的資料,所以本地要 18 秒,但放到 S3 反而秒殺 XDDD

整體看起來還不錯?算是一種 Data warehouse 的方案,主要是要用到 row-based format 儲存的優點,遇到一些冷資料可以這樣玩。

從「Using the S3 Storage Engine」這邊的設定方式看到 s3_host_name,看起來有機會接其他家的 S3 API,或是本地的 Storage。

話說 Aria 這個引擎當初最主要的重點就在 crash-safe,在有了 crash-safe 之後,DRBD 這種 block-level replication 機制就可以硬幹上去,後來主力就在擴充其他型態了,像是 GIS 與 virtual column 的功能,不過這些功能本家在 InnoDB 上好像也都陸陸續續跟上來了,單純的 Aria engine 好像還好...

Backblaze B2 支援相容 Amazon S3 的 API

Backblaze 宣佈支援相容 Amazon S3 的 API:「Backblaze B2 Cloud Storage Now Has S3 Compatible APIs」。

Amazon S3 的 API 算是 object storage 這個領域的 de facto standard 了,支援 Amazon S3 相容層可以讓現有的工具直接套用上去。

很多 client 軟體都藉著設定 API endpoint 的方式來支援 (通常預設會是 Amazon S3 的,這次的 endpoint 可以從 B2 的文件「S3 Compatible API」裡看到:

The format for endpoints for the Backblaze S3 Compatible API:


The Backblaze S3 Compatible API endpoints only accept connections over HTTPS. Non-secure connections will be rejected. The AWS SDKs and most integrations only require an Endpoint URL like the above (without the bucket name included).

另外也支援使用 bucket name 的形式操作:

If making the HTTP calls directly, the Backblaze S3 Compatible API supports specifying the bucket name in the hostname of the URL or in the path section of the URL. Both URLs below are valid examples of an endpoint calling a bucket:

B2 的另外一個優勢是 2018 的時候就跟 Cloudflare 合作 (參考「Backblaze 與 Cloudflare 合作,免除傳輸費用」),從 B2 到 Cloudflare 的流量是不收費的,再加上 Cloudflare 的流量也可以是免費的,組合起來就變成一個很便宜的方案 (只有 B2 的 storage cost)。

Amazon Elasticsearch Service 可以利用 S3 當作二級儲存空間了

Amazon Elasticsearch Service 的新功能,使用 Amazon S3 當作第二級儲存空間 (UltraWarm):「Announcing UltraWarm (Preview) for Amazon Elasticsearch Service」。

UltraWarm 需要不同的機器 (跑不同版本?),機器的規格 (vCPU 與記憶體的比率) 接近 Memory Optimized 的版本,但是貴了不少,所以需要夠大的資料量才會打平回來...

us-east-1 來看,SSD EBS 的空間成本就是 USD$0.135/GB,而傳統磁性硬碟是 USD$0.067/GB (不知道收不收 I/O 費用?),但 storage 的價錢是 USD$0.024/GB。這邊值得一提的是 Amazon S3 是 USD$0.023/GB,看起來是直接包括了 API 的呼叫費用?

Amazon S3 的 Replication 也給出 SLA 了

Amazon S3 的 cross-region replication 與 same-region replication 也提供 SLA 了:「S3 Replication Update: Replication SLA, Metrics, and Events」。

  • Most of the objects will be replicated within seconds.
  • 99% of the objects will be replicated within 5 minutes.
  • 99.99% of the objects will be replicated within 15 minutes.


When you enable this feature, you benefit from the associated Service Level Agreement. The SLA is expressed in terms of a percentage of objects that are expected to be replicated within 15 minutes, and provides for billing credits if the SLA is not met:

  • 99.9% to 98.0% – 10% credit
  • 98.0% to 95.0% – 25% credit
  • 95% to 0% – 100% credit

不過只保證 99% 的物件在五分鐘內會被 replicate 有點低,應該跟底層的網路 latency 有關?

Amazon S3 推出同一個區域的同步複製功能

Amazon S3 推出了 Same-Region Replication:「Amazon S3 introduces Same-Region Replication」。

先前的功能只有 Cross-Region Replication,可以當作異地備份的功能,現在則是推出讓同一區也可以複製...


Replicated objects can be owned by the same AWS account as the original copy or by different accounts, to protect from accidental deletion.