Amazon Prime Video 捨棄 AWS Step Functions 回頭用 EC2 與 ECS 省錢的文章

昨天在 Hacker News 上熱烈討論的文章,是一篇三月就放出來,但昨天被丟上來意外的熱烈討論,在講 Amazon Prime Video 的團隊改寫程式,把 AWS Step Functions 拔掉,並且回頭用 EC2ECS 而省下大量 AWS 費用的文章討論:「Scaling up the Prime Video audio/video monitoring service and reducing costs (primevideotech.com)」,原文在「Scaling up the Prime Video audio/video monitoring service and reducing costs by 90%」,Internet Archive 的備份Archive Today 的備份

先看文章的部分,裡面提到了他們用 AWS Step Functions,但意外的貴:

The initial version of our service consisted of distributed components that were orchestrated by AWS Step Functions. The two most expensive operations in terms of cost were the orchestration workflow and when data passed between distributed components.

然後改寫程式把所有東西都放在單一 process 裡面跑就好,用標準的 EC2 或是 ECS 就可以 scale 很好,而且也省錢:

To address this, we moved all components into a single process to keep the data transfer within the process memory, which also simplified the orchestration logic. Because we compiled all the operations into a single process, we could rely on scalable Amazon Elastic Compute Cloud (Amazon EC2) and Amazon Elastic Container Service (Amazon ECS) instances for the deployment.

可以看出起因是一開始設計時的 overdesign,把可以簡單處理的東西拆開,另外加上雲端在這塊收費特別貴而導致成本爆增... 這件事情偶而會發生,尤其是比較新的東西會沒注意到成本,通常在上線發現不太對的時候就會安排 refactor 掉。

但如果是 Amazon 自家集團的其他團隊出來抱怨,就有很棒的 PR 效果了,所以 Hacker News 上就看到有人在猜可能過不久後文章就會不見 XD (但文章紅了以後應該就不會不見 XD):

My word. I'm sort of gob smacked this article exists.

I know there are nuances in the article, but my first impression was it's saying "we went back to basics and stopped using needless expensive AWS stuff that caused us to completely over architect our application and the results were much better". Which is good lesson, and a good story, but there's a kind of irony it's come from an internal Amazon team. As another poster commented, I wouldn't be surprised if it's taken down at some point.

很政治不正確的文章 XD

以之前的經驗來說,AWS 上類似的東西還包括了 NAT Gateway,這東西只適合在有強資安需求 (像是法規要求),而且需要連外的流量很少的時候適合。

NAT Gateway 在新加坡 ap-southeast-1 要 $0.059/hr (美金,所以大約是 $42.48/mo),以及 US$0.045/GB 的處理費用,所以假設你每天只有 100GB (平均 10Mbps),就等於是 3TB/mo,要 $135/mo。這樣整包就 $172.48/mo 了。

如果讓 EC2 機器直接連去 internet 抓資料的話,這些費用就是 $0,你只要付無論是有 NAT Gateway 或是沒有 NAT Gateway 的 outbound traffic 費用部分 (大多是各種 TCP/TLS/HTTP header)。

比較省成本的解法是用 security group 對 outbound traffic 開放特定的流量來解。

另外一種方式還是 NAT,但是是自己架設 HA 的 NAT service,像是 2015 年的文章「The Right Way to set up NAT in EC2」提到的方法。

這個方法以現在的機種來說,兩台 t4g.nano 的機器加上 EBS 不到 $10/mo,唯一要注意的應該是網路頻寬雖然可以 burst 到 5Gbps,但他的網路頻寬是 credit 機制,當 credit 用完的時候 t4g.nano 記得是剩下 100Mbps 左右?不過真的有這個量的時候機器也可以往上開大一點...

另外還有很多「好用」的雲端服務,但看到帳單後就變得「不好用」的雲端服務... 在用之前先算一下成本就會發現了。

把 AWS 的 Billing 資料接進 Grafana 上...

Twitter 上看到 Grafana 的帳號提到了一篇把 AWS Billing 資料接進 Grafana 上的文章:

這個想法還頗不賴的,有些東西剛好可以交叉比對拉出來...

Mastercard 對實體物品提供免費試用後的訂閱條款

Mastercard 規定在免費試用後 (實體物品),需要另外再讓使用者再同意一次才能開始收訂閱費用:「Free Trials Without The Hassle」。

The rule change will require merchants to gain cardholder approval at the conclusion of the trial before they start billing. To help cardholders with that decision, merchants will be required to send the cardholder – either by email or text – the transaction amount, payment date, merchant name along with explicit instructions on how to cancel a trial.

新聞一開始出來時其實讓蠻多人關注的,因為一堆網路服務都是靠這招... 所以 Mastercard 在文章後更新說明,目前只有實體物品套用這個規則:

*This blog was updated on January 17, 2019 to clarify that the rule change is applicable to physical products such as skincare, healthcare items etc.

AWS 讓你可以禁止 RI 跨帳號計算了...

現在 AWS 讓你可以設定,是否允許 Reserved Instance (RI) 跨帳號使用:「Customize your organization’s AWS credit and Reserved Instance (RI) discount sharing using new billing preferences」。

以往是優先用在自己帳號,但如果有剩的話可以挪去其他帳號用。這樣雖然比較省錢,但有時候會造成帳務的「困擾」:

Historically, AWS has maximized customer savings by applying credits and RI discounts first to the account that owned the credit or RI lease and then distributing the remainder, if any, to qualifying usage incurred by accounts in the same organization. While this approach had the potential of lowering the overall bill, customers were unable to control if, and how, discounts were applied across organizational lines.

現在則是可以關掉:

To provide greater flexibility, customers can now disable AWS credit sharing across all accounts in their organization. This ensures that only the account that owns a credit, or has previously redeemed a credit, receives the associated benefit.

也可以分開設定:

You can also designate a set of accounts for which RI discount sharing is disabled, while continuing to share RI discounts among the rest of the accounts in your organization.

這樣雖然會比較貴,但這其實反應到某些組織文化上的問題啦...

Amazon EC2 的 CRI 支援一年版本了...

Amazon EC2 的 CRI (Convertible Reserved Instance) 支援一年的合約了:「EC2 Convertible Reserved Instance Update – New 1-Year CRI, Merges & Splits」,這樣彈性再多了一些:

Today we are introducing Convertible RIs with a 1-year term, complementing the existing 3-year term.

不過 CRI 主要是用在需要換 family type 的情境下,如果是已知 family type (像是一般性的 worker 會選 C4 或是剛推出的 C5) 那麼就直接選擇 Regional RI 就好...

基本上就是讓財務操作上多個選擇 :o

EC2 與 EBS 十月開始以秒計費

雖然只是 Amazon EC2Amazon EBS 計價模式的改變,但這次 AWS 的改變對於許多開發流程有很大的影響 (重點在 EC2 的部份):「New – Per-Second Billing for EC2 Instances and EBS Volumes」。

10/2 開始改變 (而不是 10/1),低消一分鐘,Windows 機種以及需要額外收費的 Linux 機種不在範圍內:

This change is effective in all AWS Regions and will be effective October 2, for all Linux instances that are newly launched or already running. Per-second billing is not currently applicable to instances running Microsoft Windows or Linux distributions that have a separate hourly charge. There is a 1 minute minimum charge per-instance.

然後 Spot 與買 RI 後也是一樣以秒計價:

List prices and Spot Market prices are still listed on a per-hour basis, but bills are calculated down to the second, as is Reserved Instance usage (you can launch, use, and terminate multiple instances within an hour and get the Reserved Instance Benefit for all of the instances).

這次改變的影響很巨大。馬上可以想到幾個情境...

第一個是對於實踐 Release early, release often 的團隊來說,如果設計成每 deploy 一次就建一個新的 AMI (最乾淨的作法),再開新機器換掉的話,成本就會增加不少。所以對於這樣的團隊,就會偏好朝著替換現有目錄內的東西後重啟...

現在改成以秒計費後,直接透過 Blue-Green Deployment 就可以了 (AWS CodeDeploy 年初也支援了:「AWS CodeDeploy 支援 BlueGreenDeployment」):(如果不熟悉 Blue-Green Deployment 的話,更白話的說法就是「先建後拆」...)

同樣的理由,對於 Auto Scaling 的 policy 也有些改變。之前機器開起來都會想讓他跑一個小時,所以 scale down 的部份都會寫的比較鬆一點。現在就可以重新規劃了...

另外一個影響是對使用 container 的誘因少了不少。很多人用 container 的用法是開大台機器再裡面拆給不同服務用,讓資源利用率變高,現在變成用多少算多少後就不太需要這樣了...

當然也還是有缺點。以前 Spot Instance 如果被 AWS 收回時,最後的那個小時是不計費的。現在因為以秒計費,變成要收費了...

最後是 10/2 生效這件事情頗怪,該不會是財務部門不願意配合 10/1 星期天加班生效,所以只好變成 10/2 生效這種理由吧... XDDD

Netflix 把金流相關的系統轉移到 AWS 上跑 MySQL 的故事...

這次要提的是「Netflix Billing Migration to AWS」、「Netflix Billing Migration to AWS - Part II」與「Netflix Billing Migration to AWS - Part III」這三篇。

Netflix 先前的金流相關系統跑的是 Oracle 的資料庫:

然後換成 MySQL

系統上是採用 DRBD,然後底層是 5 個 4TB 的 EBS 組成的 RAID 0,跑 LVM

High performance with respect to reads and writes was achieved by using RAID0 with EBS provisioned IOPS volumes. To get more throughput per volume, 5 volumes of 4TB each were used, instead of 1 big volume. This was to facilitate faster snapshots and restores.

LVM to manage two Logical Volume’s (DB and DRBD Metadata) within single Volume Group.

可以看到裡面用的都是很經過時間考驗的技術,像是 DRBD、標準的 Replication 架構...

用 Pushover 當簡訊...

很久之前被 ccn 介紹 Pushover,可以很簡單的透過 API 送推播,這樣就可以用來代替簡訊發給自己。

第一次申請有七天的試用期可以用,試用期滿後每個 device 的費用是一次性的 USD$4.99,在 iOS 裝置上可以透過 IAP (Apple) 購買,Android 裝置則是透過 IAB 購買。

官網上可以看到 API 設計很簡單,user token + application token 用 POST 帶進去就可以發出去了。

就算不透過 API 寫,也可以透過 IFTTT 串接起來,像是我設定中文維基百科上的條目「Kalafina」,有修改就通知我:

AWS Price List API

AWS 把價錢資訊也 API 化了:「New – AWS Price List API」。

除了可以透過 API 取得資訊外,還可以透過 Amazon SNS,在價錢有變動時得到通知:

You can also elect to receive notification via Amazon Simple Notification Service (SNS) each time we make a price change.

讓 billing 的各種計算變方便。