Home » Posts tagged "downtime"

1Password 使用 Terraform 的案例...

看到「Terraforming 1Password」這篇以及這則 Tweet,講 1Password 導入 Terraform 將會中斷服務幾個小時:

這根本是負面宣傳 XDDD

無論是對 1Password 的技術能力,或是 Terraform 的彈性來說...

Google 的 Cloud Spanner

GoogleCloud Spanner 這個服務拿出來賣了:「Introducing Cloud Spanner: a global database service for mission-critical applications」,以及說明的「Inside Cloud Spanner and the CAP Theorem」。

Cloud Spanner 的規劃上是希望有 RDBMS 的能力 (像是 ACID 特性),又有強大的擴充能力 (scalability) 與可用性 (availability):

Today, we’re excited to announce the public beta for Cloud Spanner, a globally distributed relational database service that lets customers have their cake and eat it too: ACID transactions and SQL semantics, without giving up horizontal scaling and high availability.

在說明裡有提到 Cloud Spanner 是做到 CAP theorem 裡面的 CP:

The purist answer is “no” because partitions can happen and in fact have happened at Google, and during some partitions, Spanner chooses C and forfeits A. It is technically a CP system.

然後把 A 拉高到使用者不會在意 downtime 的程度:

However, no system provides 100% availability, so the pragmatic question is whether or not Spanner delivers availability that is so high that most users don't worry about its outages.

當然,比較讓人爭議的是 Twitter 上 Google Cloud 官方帳號的 tweet,直接講同時解決了 CAP 三個條件:


AWS CodeDeploy 支援 BlueGreenDeployment

AWS CodeDeploy 推出了 BlueGreenDeployment 的功能:「AWS CodeDeploy Introduces Blue/Green Deployments」。

BlueGreenDeployment 的目的不計成本想辦法把上線的 downtime 壓到最低,而且當出問題時 rollback 的時間壓到最低的方法:

One of the challenges with automating deployment is the cut-over itself, taking software from the final stage of testing to live production. You usually need to do this quickly in order to minimize downtime.

Blue-green deployment also gives you a rapid way to rollback - if anything goes wrong you switch the router back to your blue environment.

其實就是直接跑兩個環境 (所以成本比較高),一套跑舊的一套跑新的,然後在前面的 load balancer 切換:

The blue-green deployment approach does this by ensuring you have two production environments, as identical as possible.


CNN 這篇「Netflix goes down. Twitter blows up」提到了昨天 Netflix 當了好幾個小時的情況。裡面提到了 Down Detector 這個服務:

Downdetector -- which monitors outage complaints online -- reported more than 13,000 posts from users all over the world Saturday afternoon.

到 Down Detector 網站上看,這個服務有一部份是從 social network 上挖資料:

Downdetector collects status reports from a series of sources. Through a realtime analysis of this data, our system is able to automatically determine outages and service interruptions at a very early stage. One of the sources that we analyse are reports on Twitter.

甚至還可以挖出是全域性的還是區域性的 outage...


UPS 反而是機房斷線最常見的肇因:「Survey: UPS Issues Are Top Cause of Outages」。

這是美國機房的調查,取樣則是從「歸咎於機房的問題」中的 453 件分析,原因包括了:

  • UPS battery failure (65 percent)
  • Exceeding UPS capacity (53 percent)
  • Accidental emergency power off (EPO)/human error (51 percent)
  • UPS equipment failure (49 percent)

應該是多選吧?不然超過 100% 了?