跨雲端的 Zero Downtime 轉移

看到「Ask HN: Have you ever switched cloud?」這個討論,在講雲端之間的搬遷,其中 vidarh 的回答可以翻一下...

首先是他提到原因的部份,基本上都是因為錢的關係,從雲搬到另外一個雲,然後再搬到 Dedicated Hosting 上:

Yes. I once did zero downtime migration first from AWS to Google, then from Google to Hetzner for a client. Mostly for cost reasons: they had a lot of free credits, and moved to Hetzner when they ran out.

Their savings from using the credits were at least 20x what the migrations cost.

然後他也直接把整理的資料丟出來,首先是在兩端上都先建立 load balancer 類的服務:

* Set up haproxy, nginx or similar as reverse proxy and carefully decide if you can handle retries on failed queries. If you want true zero-downtime migration there's a challenge here in making sure you have a setup that lets you add and remove backends transparently. There are many ways of doing this of various complexity. I've tended to favour using dynamic dns updates for this; in this specific instance we used Hashicorp's Consul to keep dns updated w/services. I've also used ngx_mruby for instances where I needed more complex backend selection (allows writing Ruby code to execute within nginx)

再來是打通內網,其實就是 site-to-site VPN:

* Set up a VPN (or more depending on your networking setup) between the locations so that the reverse proxy can reach backends in both/all locations, and so that the backends can reach databases both places.

然後建立資料庫的 replication server 以及相關的機制:

* Replicate the database to the new location.

* Ensure your app has a mechanism for determining which database to use as the master. Just as for the reverse proxy we used Consul to select. All backends would switch on promoting a replica to master.

* Ensure you have a fast method to promote a database replica to a master. You don't want to be in a situation of having to fiddle with this. We had fully automated scripts to do the failover.

然後是確認 application 端可以切換自如:

* Ensure your app gracefully handles database failure of whatever it thinks the current master is. This is the trickiest bit in some cases, as you either need to make sure updates are idempotent, or you need to make sure updates during the switchover either reliably fail or reliably succeed. In the case I mentioned we were able to safely retry requests, but in many cases it'll be safer to just punt on true zero downtime migration assuming your setup can handle promotion of the new master fast enough (in our case the promotion of the new Postgres master took literally a couple of seconds, during which any failing updates would just translate to some page loads being slow as they retried, but if we hadn't been able to retry it'd have meant a few seconds downtime).

然後確認新的雲端有足夠的 capacity 撐住流量後,就是要轉移了,首先是降低 DNS TTL:

Once you have the new environment running and capable of handling requests (but using the database in the old environment):

* Reduce DNS record TTL.

然後把舊的 load balancer 指到新的後端,這時候如果發現問題可以快速 rollback 回來:

* Ensure the new backends are added to the reverse proxy. You should start seeing requests flow through the new backends and can verify error rates aren't increasing. This should be quick to undo if you see errors.

接著把 DNS 指到新的 load balancer,理論上應該不會有太大問題:

* Update DNS to add the new environment reverse proxy. You should start seeing requests hit the new reverse proxy, and some of it should flow through the new backends. Wait to see if any issues.

接著把資料庫切到新的機房,有問題時可以趕快切回去再確認哪邊有狀況:

* Promote the replica in the new location to master and verify everything still works. Ensure whatever replication you need from the new master works. You should now see all database requests hitting the new master.

最後的階段就是拔掉舊的架構:

* Drain connections from the old backends (remove them from the pool, but leave them running until they're not handling any requests). You should now have all traffic past the reverse proxy going via the new environment.

* Update DNS to remove the old environment reverse proxy. Wait for all traffic to stop hitting the old reverse proxy.

* When you're confident everything is fine, you can disable the old environment and bring DNS TTL back up.

其實這個方法跟雲端沒什麼關係,以前搞機房搬遷的時候應該都會規劃過類似的方案,大方向也都類似 (把 stateful services 與 stateless services 拆開來分析),只是不像雲端的彈性租賃,硬體要準備比較多...

我記得當年 Instagram 搬進 Facebook 機房的時候也有類似的計畫,之前有提過:「Instagram 從 AWS 搬到 Facebook 機房」。

台灣最近的話,好像是 PChome 24h 有把機房搬到 GCP 上面?看看他們之後會不會到 GCP 的場子上發表他們搬遷的過程...

Leave a Reply

Your email address will not be published. Required fields are marked *