Amazon EFS 效能提昇的一些討論

上一篇「Amazon EFS 的效能提昇」提到 Amazon EFS 的效能提昇,在 Hacker News 上看到 Amazon EFS 團隊的 PMT (Product-Manager-Technical) 出來回一些東西:「Amazon Elastic File System Update – Sub-Millisecond Read Latency (」,搜尋 geertj 應該就可以看到他回的東西了...

像是即使是 Jeff Barr 發表這篇文章,也還是經過 legal team 的同意才能發表:

(PMT on the EFS team).

Yes, the wordings are carefully formulated as they have to be signed off by the AWS legal team for obvious reasons. With that said, this update was driven by profiling real applications and addressing the most common operations, so the benefits are real. For example, a simple WordPress "hello world" is now about 2x as fast as before.

另外這次的效能提昇是透過 cache 層達成的:

I'm the PMT for this project in the EFS team. The "flip the switch" part was indeed one of the harder parts to get right. Happy to share some limited details. The performance improvement builds on a distributed consistent cache. You can enable such a cache in multiple steps. First you deploy the software across the entire stack that supports the caching protocol but it's disabled by configuration. Then you turn it for the multiple components that are involved in the right order. Another thing that was hard to get right was to ensure that there are no performance regressions due to the consistency protocol.

然後在每個 AZ 都有 cache:

The caches are local to each AZ so you get the low latency in each AZ, the other details are different. Unfortunately I can't share additional details at this moment, but we are looking to do a technical update on EFS at some point soon, maybe at a similar venue!

另外看起來主要就是 metadata cache 的幫助:

NFS workloads are typically metadata heavy and highly correlated in time, so you can achieve very high hit rates. I can't share any specific numbers unfortunately.

還是有很多細節數字不能透漏,但知道是透過 cache 達成的就已經可以大致上想像後面是怎麼弄出來的了...

AWS App Runner 總算可以存取 VPC 內的資源了

算是上個星期的消息了,App Runner 這個產品剛出來的時候無法連到 VPC 內的資源,不知道要怎麼用,現在總算是把這個功能補上了:「New for App Runner – VPC Support」。

不過還是不看好,旁邊還有 AWS Elastic BeanstalkAWS Amplify 同質性超高的服務,都是只寫 code 丟上去就能跑:

AWS App Runner is a fully managed service that makes it easy for developers to quickly deploy containerized web applications and APIs, at scale and with no prior infrastructure experience required. Start with your source code or a container image. App Runner builds and deploys the web application automatically, load balances traffic with encryption, scales to meet your traffic needs, and makes it easy for your services to communicate with other AWS services and applications that run in a private Amazon VPC. With App Runner, rather than thinking about servers or scaling, you have more time to focus on your applications.

AWS Elastic Beanstalk is an easy-to-use service for deploying and scaling web applications and services developed with Java, .NET, PHP, Node.js, Python, Ruby, Go, and Docker on familiar servers such as Apache, Nginx, Passenger, and IIS.

You can simply upload your code and Elastic Beanstalk automatically handles the deployment, from capacity provisioning, load balancing, auto-scaling to application health monitoring. At the same time, you retain full control over the AWS resources powering your application and can access the underlying resources at any time.

AWS Amplify is a set of purpose-built tools and features that lets frontend web and mobile developers quickly and easily build full-stack applications on AWS, with the flexibility to leverage the breadth of AWS services as your use cases evolve. With Amplify, you can configure a web or mobile app backend, connect your app in minutes, visually build a web frontend UI, and easily manage app content outside the AWS console. Ship faster and scale effortlessly—with no cloud expertise needed.

更不用說旁邊還有 Lambda 類的架構...

Amazon EC2 上的一些小常識

Twitter 上看到 Laravel News 轉發了「Mistakes I've Made in AWS」這篇,講 Amazon EC2 上面的一些小常識。

在 EC2 中,T 系列的機器 (目前主要是 t2/t3/t3a/t4g) 對於開發很好用,甚至對於量還不大的 production system 也很好用,加上 Unlimited 模式可以讓你在 CPU credit 用完時付錢繼續 burst。

文章裡面有討論到,使用 T 系列機器時,常常是不怎麼需要大量 CPU 資源的情境,這時候 AMD-based 的 t3a 通常都是個還不錯的選擇,大概會比 Intel-based 的 t3 省 10% 的費用。另外如果可以接受 ARM-based 的話,t4g 也是個選項,價錢會更便宜而且在很多應用下速度會更快。不過同事有遇到 Python 上面跑起來的行為跟 x86-64-based 的不同,這點就得自己琢磨了...

另外就是目前的 EBS 預設還是會使用 gp2,而在 gp3 出來後其實大多數的情況下應該可以換過去,主要就是便宜了 20%,加上固定的 3000 IOPS。

不過也是有些情境下是不應該換的,主要是 gp2 可以 burst 到 250MB/sec,但 gp3 只給了 125MB/sec。雖然 gp3 可以加價買 throughput,但加價的費用不低,這種需求改用 gp2 應該會比較划算。

不過這邊推薦比較技術的作法,可以掛兩個 gp3 (也可以更多) 跑 RAID0 (像是在 Linux 上可以透過 mdadm 操作),這樣 IOPS 與 throughput 都應該可以拉上來...

MySQL 在不同種類 EBS 上的效能

Percona 的人寫了一篇關於 MySQL 跑在 AWS 上不同種類 EBS 的效能差異:「Performance of Various EBS Storage Types in AWS」,不過這篇的描述部份不是很專業,重點是直接看測試資料建立自己的理解。

他的方法是在 AWS 上建立了相同參數的 gp2gp3io1io2 空間,都是 1TB 與 3000 IOPS,但他提到這應該會一樣:

So, all the volumes are 1TB with 3000 iops, so in theory, they are the same.

但這在「Amazon EBS volume types」文件上其實都有提過了,先不管 durability 的部份,光是與效能有關的規格就不一樣了。

在 gp2 的部份直接有提到只有保證 99% 的時間可以達到宣稱的效能:

AWS designs gp2 volumes to deliver their provisioned performance 99% of the time.

而 gp3 則是只用行銷宣稱「consistent baseline rate」,連 99% 都不保證:

These volumes deliver a consistent baseline rate of 3,000 IOPS and 125 MiB/s, included with the price of storage.

io* 的部份則是保證 99.9%:

Provisioned IOPS SSD volumes use a consistent IOPS rate, which you specify when you create the volume, and Amazon EBS delivers the provisioned performance 99.9 percent of the time.

另外在測試中 gp2gp3 的 throughput 看起來也沒調整成一樣的數字。在 1TB 的 gp2 中會給 250MB/sec 的速度,1TB 的 gp3 則是給 125MB/sec,除非你有加買 throughput。

另外從這句也可以看出來他對 AWS 不熟:

The tests were only run in a single availability zone (eu-west-1a).

在「AZ IDs for your AWS resources」這邊有提過不同帳號之間,同樣代碼的 AZ 不一定是一樣的區域,需要看 AZ ID:

For example, the Availability Zone us-east-1a for your AWS account might not have the same location as us-east-1a for another AWS account.

To identify the location of your resources relative to your accounts, you must use the AZ ID, which is a unique and consistent identifier for an Availability Zone. For example, use1-az1 is an AZ ID for the us-east-1 Region and it is the same location in every AWS account.

在考慮到只有設定大小與 IOPS 的情況下,剩下的測試結果其實跟預期的差不多:io2 貴但是可以得到最好的效能,io1 的品質會差一些,gp3 在大多數的情況下其實很夠用,但要注意預設的 throughput 沒有 gp2 高。

Elasticsearch 的 Python 套件開始阻擋 OpenSearch 的伺服器了

Hacker News Daily 上看到的:「Official Elasticsearch Python library no longer works with open-source forks (」,連結所指向的是 GitHub 上的 pull request,在「Verify connection to Elasticsearch #1623」這邊。

講白了也就是 Elasticsearch 官方的 Python client 開始阻擋 AWS 主推的 OpenSearch

另外 AWS 這邊也出手,把本來的 client 都 fork 出來:「Keeping clients of OpenSearch and Elasticsearch compatible with open source」,這場戰爭還有得打...

AWS 宣佈 EBS io2 的新花樣 Block Express Volumes

看到「AWS Announces General Availability of Amazon EBS io2 Block Express Volumes」這篇,在 EBSio2 上面又推出了新的花樣 Block Express Volumes:

Today AWS announced general availability of io2 Block Express volumes that deliver up to 4x higher throughput, IOPS, and capacity than io2 volumes, and are designed to deliver sub-millisecond latency and 99.999% durability.

要再提供更高的效能,在 R5b 的機種下,單個 volume 可以拉到 256k IOPS 與 4000MB/sec 的傳輸速度,以及在 well-tuned 的環境下 (應該是多個 volume) 可以拉到 260k IOPS (多一點點) 與 7500MB/sec (將近原來的兩倍) 的傳輸速度:

Using R5b instances customers can now provision a single io2 volume with up to 256,000 IOPS, 4000 MB/s of throughput, and storage capacity of 64 TiB.

R5b instances are well-suited to run business-critical and storage-intensive applications as they offer the highest EBS-optimized performance of up to 260,000 IOPS and 7,500 MB/s throughput.


Amazon ECS Anywhere

在去年年底 AWS 的公佈的「re:Invent 2020 – Preannouncements for Tuesday, December 1」裡面提到兩個有趣的產品,一個是 Amazon ECS Anywhere,另外一個是 Amazon EKS Anywhere,現在 Amazon ECS Anywhere 開放了:「Getting Started with Amazon ECS Anywhere – Now Generally Available」。

這兩個服務都是把自家的機器 container 化然後讓 AWS 的服務直接管理,只是一個是 ECS (AWS 自家的規格),另外一個是 EKS (基於 Kubernetes),這次丟出來當然很重要,不過還是會等 EKS Anywhere 出來後一起比較看看。

價錢的部份就是照機器數量算,一台機器大約 USD$7.38/month,以 bare metal 等級的機器來說倒是沒什麼問題:

You pay $0.01025 per hour for each managed ECS Anywhere on-premises instance. An on-premises instance is a customer-managed instance that has been registered with an Amazon ECS cluster and runs the Amazon ECS container agent.

這樣讓地端的機器更容易上雲,不過離台灣本地沒有 region 在網路的 latency 上就有點討厭了,另外一種搞法是找 dedicated hosting 或是自己塞機器進 colocation hosting,然後掛上這類服務?

AWS 跳出來決定繼續搞 Elasticsearch 了

先前提到「Elasticsearch 與 Kibana 也變成非 Open Source 軟體」,後來 Elastic 的 CEO (創辦人) 發了一篇「Amazon: NOT OK - why we had to change Elastic licensing」直接批評 AWS

接下來是 AWS 跳出來放話了,基本上也是個新聞稿:「Stepping up for a truly open source Elasticsearch」,大概就是會繼續維護自己的版本,維持本來的 Apache License, Version 2.0,然後批評 Elastic 所說的話不實之類的...


gp3 (Amazon EBS) 的 latency

昨天把手上所有的 Amazon EBSgp2 換到 gp3 了:「Amazon EBS 的 gp3 可以用在開機磁碟了」,今天早上來看一下狀態,整體看起來是還 OK,不過有些地方值得注意的,像是標題寫到的 latency。

我抓了跑 GitLab 的機器來看,可以很明顯看到讀寫的 latency 都變高了:

AWS 又有提到這些數字資料有經過轉換,看起來是 gp2gp3 的數字意義本來就不一樣,所以他必須想辦法轉換,所以也有可能是因為這個轉換導致的?

This graph has had transformations applied to it and will differ from what is natively found in CloudWatch. Due to this some functionality is reduced.


AWS 推出 Amazon Elastic Container Registry Public (公開版的 ECR)

算是延伸產品線,把 Amazon ECR 變成可以公開使用:「Amazon Elastic Container Registry Public: A New Public Container Registry」。

這篇稍微有趣的地方是,文章裡面的上面這張圖有把 path 模糊化,但下面那張沒有遮,後面的文字也直接有提到 path (這是要給使用者玩的...):

ECR Public 會自動同步到兩個 region,但設定的頁面上好像沒寫會怎麼挑... 另外前面會放 CloudFront 加速。

ECR Public automatically replicates container images across two AWS Regions to reduce download times and improve availability. Therefore, using public images directly from ECR Public may simplify your build process if you were previously creating and managing local copies. ECR Public caches image layers in Amazon CloudFront, to improve pull performance for a global audience, especially for popular images.