Home » Posts tagged "downtime"

Braintree (PayPal) 用 PostgreSQL 的方式

RDBMS 最困難的事情都圍繞在「怎麼不中斷服務」(很多事情在不用考慮 uptime/downtime 的前提下很好做,不論是 ALTER 或是 failover,到備份還原計畫),而 PayPalBraintree 在「PostgreSQL at Scale: Database Schema Changes Without Downtime」這邊討論修改 PostgreSQL 的 database schema 時怎麼不中斷服務。

文章內的大部份都是給 DBA 知道的細節 (e.g. 怎麼樣才不會觸發大規模的 lock 導致服務中斷),而不是開發者面向的事情... 但開頭的部份,也是我認為最重要的部份,則是需要 Developer 參與的:

For all code and database changes, we require that:

  • Live code and schemas be forward-compatible with updated code and schemas: this allows us to roll out deploys gradually across a fleet of application servers and database clusters.
  • New code and schemas be backward-compatible with live code and schemas: this allows us to roll back any change to the previous version in the event of unexpected errors.

為了符合這兩個要素,可能會在 schema 設計上有好幾個階段的操作,而非一次到位。而且也才能避免要關站從 backup 倒資料回來的情況...

建議可以研究看看要怎麼玩,常見的情境知道怎麼設計步驟後,真的遇到的時候會比較熟練。

1Password 使用 Terraform 的案例...

看到「Terraforming 1Password」這篇以及這則 Tweet,講 1Password 導入 Terraform 將會中斷服務幾個小時:

這根本是負面宣傳 XDDD

無論是對 1Password 的技術能力,或是 Terraform 的彈性來說...

Google 的 Cloud Spanner

GoogleCloud Spanner 這個服務拿出來賣了:「Introducing Cloud Spanner: a global database service for mission-critical applications」,以及說明的「Inside Cloud Spanner and the CAP Theorem」。

Cloud Spanner 的規劃上是希望有 RDBMS 的能力 (像是 ACID 特性),又有強大的擴充能力 (scalability) 與可用性 (availability):

Today, we’re excited to announce the public beta for Cloud Spanner, a globally distributed relational database service that lets customers have their cake and eat it too: ACID transactions and SQL semantics, without giving up horizontal scaling and high availability.

在說明裡有提到 Cloud Spanner 是做到 CAP theorem 裡面的 CP:

The purist answer is “no” because partitions can happen and in fact have happened at Google, and during some partitions, Spanner chooses C and forfeits A. It is technically a CP system.

然後把 A 拉高到使用者不會在意 downtime 的程度:

However, no system provides 100% availability, so the pragmatic question is whether or not Spanner delivers availability that is so high that most users don't worry about its outages.

當然,比較讓人爭議的是 Twitter 上 Google Cloud 官方帳號的 tweet,直接講同時解決了 CAP 三個條件:

價錢不算便宜,不過對於想要找方案的人至少有選擇...

AWS CodeDeploy 支援 BlueGreenDeployment

AWS CodeDeploy 推出了 BlueGreenDeployment 的功能:「AWS CodeDeploy Introduces Blue/Green Deployments」。

BlueGreenDeployment 的目的不計成本想辦法把上線的 downtime 壓到最低,而且當出問題時 rollback 的時間壓到最低的方法:

One of the challenges with automating deployment is the cut-over itself, taking software from the final stage of testing to live production. You usually need to do this quickly in order to minimize downtime.

Blue-green deployment also gives you a rapid way to rollback - if anything goes wrong you switch the router back to your blue environment.

其實就是直接跑兩個環境 (所以成本比較高),一套跑舊的一套跑新的,然後在前面的 load balancer 切換:

The blue-green deployment approach does this by ensuring you have two production environments, as identical as possible.

服務當掉後的偵測

CNN 這篇「Netflix goes down. Twitter blows up」提到了昨天 Netflix 當了好幾個小時的情況。裡面提到了 Down Detector 這個服務:

Downdetector -- which monitors outage complaints online -- reported more than 13,000 posts from users all over the world Saturday afternoon.

到 Down Detector 網站上看,這個服務有一部份是從 social network 上挖資料:

Downdetector collects status reports from a series of sources. Through a realtime analysis of this data, our system is able to automatically determine outages and service interruptions at a very early stage. One of the sources that we analyse are reports on Twitter.

甚至還可以挖出是全域性的還是區域性的 outage...

機房斷線最常見的肇因:UPS

UPS 反而是機房斷線最常見的肇因:「Survey: UPS Issues Are Top Cause of Outages」。

這是美國機房的調查,取樣則是從「歸咎於機房的問題」中的 453 件分析,原因包括了:

  • UPS battery failure (65 percent)
  • Exceeding UPS capacity (53 percent)
  • Accidental emergency power off (EPO)/human error (51 percent)
  • UPS equipment failure (49 percent)

應該是多選吧?不然超過 100% 了?

Archives