Braintree (PayPal) 用 PostgreSQL 的方式

RDBMS 最困難的事情都圍繞在「怎麼不中斷服務」(很多事情在不用考慮 uptime/downtime 的前提下很好做,不論是 ALTER 或是 failover,到備份還原計畫),而 PayPalBraintree 在「PostgreSQL at Scale: Database Schema Changes Without Downtime」這邊討論修改 PostgreSQL 的 database schema 時怎麼不中斷服務。

文章內的大部份都是給 DBA 知道的細節 (e.g. 怎麼樣才不會觸發大規模的 lock 導致服務中斷),而不是開發者面向的事情... 但開頭的部份,也是我認為最重要的部份,則是需要 Developer 參與的:

For all code and database changes, we require that:

  • Live code and schemas be forward-compatible with updated code and schemas: this allows us to roll out deploys gradually across a fleet of application servers and database clusters.
  • New code and schemas be backward-compatible with live code and schemas: this allows us to roll back any change to the previous version in the event of unexpected errors.

為了符合這兩個要素,可能會在 schema 設計上有好幾個階段的操作,而非一次到位。而且也才能避免要關站從 backup 倒資料回來的情況...

建議可以研究看看要怎麼玩,常見的情境知道怎麼設計步驟後,真的遇到的時候會比較熟練。

AWS Lambda 也提供 SLA 了

在「AWS Lambda announces service level agreement」這邊看到 AWS Lambda 提供 99.95% 的 SLA :

We have published a service level agreement (SLA) for AWS Lambda. We will use commercially reasonable efforts to make Lambda available with a Monthly Uptime Percentage for each AWS region, during any monthly billing cycle, of at least 99.95% (the “Service Commitment”).

不過這種東西都是宣示意味比較重 (至少表示 AWS 認為產品穩定度夠上 SLA),倒不是希望會用到...

Amazon ECS 與 AWS Fargate 都納入 Amazon Compute SLA 計算

AWS 宣佈的這兩個服務 (Amazon ECSAWS Fargate) 都納入 99.99% 的 SLA 合約範圍:「Amazon Compute Service Level Agreement Extended to Amazon ECS and AWS Fargate」。

Amazon Elastic Container Service (Amazon ECS) and AWS Fargate are now included in the Amazon Compute Service Level Agreement (SLA) for 99.99% uptime and availability.

ECS 已經跑一陣子了可以理解,但 Fargate 的概念算比較新,剛出來沒多久就決定放進去比較意外...

AWS 主動提高 Amazon EC2 與 Amazon EBS 的 SLA

AWS 主動提高 Amazon EC2Amazon EBSSLA:「Announcing an increased monthly service commitment for Amazon EC2」。

Amazon EC2 is announcing an increase to the monthly service commitment in the EC2 Service Level Agreement (“SLA”), for both EC2 and EBS, to 99.99%. This increased commitment is the result of continuous investment in our infrastructure and quality of service. This change is effective immediately in all regions, and is available to all EC2 customers.

之前是 99.95% monthly (參考前幾天的頁面:「Amazon EC2 SLA」),現在拉到 99.99% 了。第一階的賠償條件也從 99.95%~99% 改成 99.99%~99% 了 (賠 10%)。

Amazon 詳細說明 AWS EBS/RDS 故障的原因...

Amazon 在網頁上說明了在 4/21 美東地區 EBS/RDS 故障的原因:「Summary of the Amazon EC2 and Amazon RDS Service Disruption in the US East Region」。

整篇文章相當長,但整個連鎖反應是起於操作時的人為疏失。而 RDS 因為是基於 EBS 上,所以也跟著大爆炸...

另外文章最後面也提到賠償的方式:

For customers with an attached EBS volume or a running RDS database instance in the affected Availability Zone in the US East Region at the time of the disruption, regardless of whether their resources and application were impacted or not, we are going to provide a 10 day credit equal to 100% of their usage of EBS Volumes, EC2 Instances and RDS database instances that were running in the affected Availability Zone.

另外,AWS 的 SLA 中有要求當未滿 SLA 條件時,必須由使用者提出退款要求,但這次則是例外,符合條件的會主動在下次的帳單上,或是系統的頁面上看到:

These customers will not have to do anything in order to receive this credit, as it will be automatically applied to their next AWS bill. Customers can see whether they qualify for the service credit by logging into their AWS Account Activity page.

Google Apps 的 Uptime 計算方式調整 (對使用者有利的方向)

Google 改變了 Google Apps 對 Uptime 的計算方式:「Google Apps Removes Scheduled Downtime Clause From SLA; Gmail Had 99% Uptime in 2010」(TC)、「Destination: Dial Tone -- Getting Google Apps to 99.99%」(Google 官方新聞稿)。

之前的條款內允許 Google 將「排定的維護停機」排除在 downtime 計算外,現在 Google 決定拿掉這條例外條款。然後在文章下面故意提到 Microsoft 的服務:

Comparable data for Microsoft BPOS® is unavailable, though their service notifications show 113 incidents in 2010: 74 unplanned outages, and 33 days with planned downtime.

有興趣看新版 SLA 的人可以連結到英文版的頁面上:「Google Apps Service Level Agreement」。