EC2 Spot Instance 價錢的上漲趨勢

在「Farewell to the Era of Cheap EC2 Spot Instances」這邊討論了 Amazon EC2spot instance 最近有上漲的趨勢,像是這張應該是從 web console 拉出來 us-east-1t4g.nano 趨勢:

有不少 region 都有類似的情況,尤其是最常用的 us-east-1us-west-2

上個月 Plurk 的朋友也有聊到類似的情況,在 us-east-1 上愈來愈難找到便宜的 spot instance 機器了,當時還在想是不是有什麼大型活動,但文章出來後才發現大家都有遇到類似的情況。

另外在 Hacker News 上面也有討論:「Farewell to the Era of Cheap EC2 Spot Instances (pauley.me)」,裡面是有提到了一些工具可以再更彈性的調整,用更多邏輯改善成本,像是 AutoSpotting - Community Edition 這個專案用 lambda 幫你調整:

The entire logic described above is implemented in a set of Lambda functions deployed using CloudFormation or Terraform stacks that can be installed and configured in just a few minutes.

回頭來看一下目前的情況 (以及猜測 AWS 的策略),如果 spot instance 的常態價錢維持在牌價的六七成,等於是逼你規劃用 Savings Plans 之類的方案,然後讓 spot instance 慢慢退場。

話說回來,接下來不知道會不會有人去告 90% saving 的廣告宣傳...

AWS 推出了 Graviton3 的機種

Amazon EC2 推出了 Graviton3 的機種:「New Graviton3-Based General Purpose (m7g) and Memory-Optimized (r7g) Amazon EC2 Instances」。

第一波只有一般的 m7g 與記憶體型的 r7g,而計算型的 c7g 大家在 Twitter 上猜應該晚點會放出消息。在去年五月就推出了:「AWS 推出 c7g 機種」。

目前只在歐美的 us-east-1us-east-2us-west-2eu-west-1 區提供,亞洲目前都還沒有這些機種可以用:

M7g and R7g instances are available today in the US East (N. Virginia), US East (Ohio), US West (Oregon), and Europe (Ireland) AWS Regions in On-Demand, Spot, Reserved Instance, and Savings Plan form.

官方宣稱比 Graviton2 的 m6g & r6g 多了 25% 的效能,不過我另外查了一下 us-east-1 上的價錢,也貴了 6% 左右,如果依照官方宣稱的數字計算,大約是 18% 左右的 CP 值提昇,對於有實際上跑滿的 CPU 的人是個不錯的效能提昇:

Today I am happy to tell you about the newest Amazon EC2 instance types, the M7g and the R7g. Both types are powered by the latest generation AWS Graviton3 processors, and are designed to deliver up to 25% better performance than the equivalent sixth-generation (M6g and R6g) instances, making them the best performers in EC2.

裡面有提到在 Graviton3 的一個架構上的大改變是記憶體從 DDR4 變到 DDR5,這使得記憶體的傳輸頻寬提昇了 50%:

Both types of instances are equipped with DDR5 memory, which provides up to 50% higher memory bandwidth than the DDR4 memory used in previous generations.

接下來是看有沒有下放到 t 系列的計畫,像是 t5g 之類的,有的話再用看看好了,不過 blog 這台已經買了三年 RI,等到期間滿了之後說不定都有 Graviton4 或是 Graviton5 了...

Amazon EC2 AMI 的 root volume 可以直接抽換了

這個功能等了十年以上總算是出現了,Amazon EC2 的 AMI 總算是能直接抽換 root volume,不用先停掉機器:「Amazon EC2 enables easier patching of guest operating system and applications with Replace Root Volume」。

Starting today, Amazon EC2 supports the replacement of instance root volume using an updated AMI without requiring customers to stop their instance. This allows customers to easily update their applications and guest operating system, while retaining the instance store data, networking and IAM configuration.

算是 pre-container 時代會遇到的問題,後來大家都把 workaround 變成 practice 了:每次需要時候都是直接整包重新打包 (像是 Packer 這類的工具),然後用工具更新 AMI id 改開新的機器,這樣就能夠避開需要先停掉現有機器的問題...

怎麼會突然想到要回來支援這個功能 XD

Amazon EC2 的 Trn1 正式開放使用

AWS 自家研發晶片的 trn1.* 上線了:「Amazon EC2 Trn1 Instances for High-Performance Model Training are Now Available」。

先前三家雲端的廠商只有 Google Cloud PlatformTPU 可以 train & evaluate,現在 AWS 推出 AWS Trainium,補上 train 這塊的產品。其中官方宣稱可以比 GPU 架構少 50% 的計算成本:

Trainium-based EC2 Trn1 instances solve this challenge by delivering faster time-to-train while offering up to 50% cost-to-train savings over comparable GPU-based instances.

然後 PyTorchTensorFlow 都有支援:

The Neuron plugin natively integrates with popular ML frameworks, such as PyTorch and TensorFlow.

另外用 neuron-ls 可以看到 Neuron 裝置的資訊,不過沒看懂為什麼要 mask 掉 private ip 的資訊:

大型的 cluster 會使用 Amazon FSx for Lustre 整合提供服務:

For large-scale model training, Trn1 instances integrate with Amazon FSx for Lustre high-performance storage and are deployed in EC2 UltraClusters. EC2 UltraClusters are hyperscale clusters interconnected with a non-blocking petabit-scale network.

但第一波開放的區域有點少,只有萬年美東一區 us-east-1 與美西二區 us-west-2

You can launch Trn1 instances today in the AWS US East (N. Virginia) and US West (Oregon) Regions as On-Demand, Reserved, and Spot Instances or as part of a Savings Plan.

us-east-1trn1.2xlarge 的價錢是 US$1.34375/hr,但沒有實際跑過比較好像沒辦法評估到底行不行...

但總算是擺出個產品對打看看,畢竟要夠大才能去訂製這些東西。

AWS 東京區有 12TB 記憶體的機器了

月初 AWS 宣佈東京區有 u-12tb1.112xlarge 可以用了:「Amazon EC2 High Memory instances with 3, 6, 9, and 12TiB of memory are now available in Asia Pacific (Tokyo) region」。查了一下 on-demend 的價錢是 $131.733/hr,如果一個月以 720 小時來算,要 $94847.76/mo...

沒記錯的話,這種機器應該是要另外申請 limit 才能開,沒辦法說測就測。另外在公告裡面有提到 savings plan ,但沒提到 RI (reserved instance),不確定是不是還沒開 RI 讓使用者買 (不過我記得 savings plan 好像也有類似的折扣結構):

Starting today, Amazon EC2 High Memory instances with 3TiB (u-3tb1.56xlarge), 6TiB (u-6tb1.56xlarge, u-6tb1.112xlarge), 9TiB (u-9tb1.112xlarge), and 12TiB of memory (u-12tb1.112xlarge) are available in Asia Pacific (Tokyo) region. Customers can start using these new High Memory instances with On Demand and Savings Plan purchase options.

這種機器是用暴力解決問題的機器...

目前 AWS 台北區只能開 *.2xlarge 的機器

前面在「AWS 的台北區 (Local Zone) 開了」這邊有提到機器開不起來,剛剛查價錢的時候才發現只能開 {c5,g4dn,m5,r5}.2xlarge

改成 c5.2xlarge 然後就開起來了:

翻了目前所有的 local zone,看起來大多都是類似的情況,選擇性會很少... 目前只有邁阿密與洛杉磯的選擇比較多,這是邁阿密:

這是洛杉磯:

這樣目前要拿來當 VPS 取代品還不太好用,就真的是 local zone 的定位。

AWS 將新的 Nitro 架構回過投來支援以前 Xen 的機種

Twitter 上看到 Jeff Barr 提到的這篇,講 AWS 決定讓新的 Nitro 架構支援舊的 Xen 機種:

原文是「Xen-on-Nitro: AWS Nitro for Legacy Instances」這篇,裡面雖然很美化的在講這件事情,但提到了幾個很現實的問題,第一個是仍然有大量使用者 (120 萬) 在用 Xen 架構的機器:

Today, we still have over 1.2 Million unique customers using Xen-based instances.

但這些機器其實愈來愈難維護,一方面是 Nitro 讓 AWS 省下很多軟體上的維護,另外一方面是幾乎不會有新的使用者用這些舊機種,在採購上面也會是問題。

However, the underlying hardware is old and it’s getting increasingly difficult to maintain support for these older hypervisor systems.

所以 EC2 的團隊把 Nitro 的 Xen 相容架構給實做出來,從 2022 年開始就可以全部都用 Nitro 系統,這樣對 EC2 團隊的維護成本就會大幅下降:

All of these innovations enable us to continue to offer many of our older instance types well past the lifetime of the original hardware. Starting in 2022, customers launching M1, M2, M3, C1, C3, R3, I2 and T1 instances will land on Nitro supported instances hardware and existing running instances will also be migrated.

技術債沒辦法消失,就用這種方式降低維護成本耗 XD

AWS 要推出 Graviton3 的機種了

AWS 打算要推出 Graviton3 的機種了,目前還在 preview 階段:「Join the Preview – Amazon EC2 C7g Instances Powered by New AWS Graviton3 Processors」。

目前是宣稱與前一代的 Graviton2 相比有 25% 的效能提昇,另外在浮點數與密碼相關的運算上面也會有改善 (這個效能提昇的數字應該是有指令集的幫助):

In comparison to the Graviton2, the Graviton3 will deliver up to 25% more compute performance and up to twice as much floating point & cryptographic performance. On the machine learning side, Graviton3 includes support for bfloat16 data and will be able to deliver up to 3x better performance.

另外提到了 signed pointer,可以避免 stack 被搞,不過這邊需要 OS 與 compiler 的支援,算是針對 stack 類的攻擊提出的防禦方案:

Graviton3 processors also include a new pointer authentication feature that is designed to improve security. Before return addresses are pushed on to the stack, they are first signed with a secret key and additional context information, including the current value of the stack pointer. When the signed addresses are popped off the stack, they are validated before being used. An exception is raised if the address is not valid, thereby blocking attacks that work by overwriting the stack contents with the address of harmful code. We are working with operating system and compiler developers to add additional support for this feature, so please get in touch if this is of interest to you.

然後是使用 DDR5 的記憶體:

C7g instances will be available in multiple sizes (including bare metal), and are the first in the cloud industry to be equipped with DDR5 memory. In addition to drawing less power, this memory delivers 50% higher bandwidth than the DDR4 memory used in the current generation of EC2 instances.

現在還沒看到價錢,不過有可能是跟 c6g 一樣的價位?但考慮到記憶體換架構,也有可能是貴一些的?

另外翻了一下資料,ARM 有發表過新聞稿提到 Graviton2 是 ARM 的 Cortex-M55 機種:「Designing Arm Cortex-M55 CPU on Arm Neoverse powered AWS Graviton2 Processors」,這次的 Graviton3 應該在之後完整公開後會有更多消息出來...

Amazon RDS 支援 readonly instance 當作 Multi AZ 的機器了

從來沒在用 RDS 的 Multi AZ,所以根本沒注意到居然沒這個功能:「New Multi-AZ deployment option for Amazon RDS for PostgreSQL and for MySQL; increased read capacity, lower and more consistent write transaction latency, and shorter failover time (Preview)」。

看起來 (加上印象中) 之前的 Multi AZ 是另外一台機器先開著但不能用:

In the case of an infrastructure failure, Amazon RDS performs an automatic failover to the standby, so that database operations resume as soon as the failover is complete.

現在則是開著的機器可以跑 readonly 模式:

The standby DB instances act as automatic failover targets and can also serve read traffic to increase throughput without needing to attach additional read replica DB instances.

這樣做除了省成本外,另外因為這些 instance 平常就有 query 的量,當真的遇到 failover 切換時,warmup 的時間也會短很多 (尤其是服務夠大的時候)。

不過有些限制,首先看起來只支援 Graviton2 (ARM-based) 的機種?

The readable standby option for Amazon RDS Multi-AZ deployments works with AWS Graviton2 R6gd and M6gd DB instances (with NVMe-based SSD instance storage) and Provisioned IOPS Database Storage.

然後是支援的區域:

The Preview is available in the US East (N. Virginia), US West (Oregon), and Europe (Ireland) regions.

以及夠新的版本,MySQL 8 與 PostgreSQL 13.4 才有提供:

Amazon RDS for MySQL supports the Multi-AZ readable standby option for MySQL version 8.0.26. Amazon RDS for PostgreSQL supports the Multi-AZ readable standby option for PostgreSQL version 13.4.

但看起來還不錯,畢竟這比較接近以前在地端機房時的作法...

Auto Scaling 就不能綁個 Lambda 嗎...

看到 AWS 宣佈 Amazon EC2 的 auto scaling 有新花樣:「New – Attribute-Based Instance Type Selection for EC2 Auto Scaling and EC2 Fleet」。

你都有 Lambda 了,就不能整合 Lambda 每分鐘跑一次,讓使用者直接用個 turing complete 的方式自己設計要的 policy 嗎... 會用到 auto scaling 的使用者不會在意 Lambda 的那幾毛錢的。

這樣做對 OKR 是比較好沒錯啦,但用的人已經懶的看了...