這個月 GitHub 的不穩問題，都是 mysql1 這個 cluster 的鍋…

GitHub 針對了這個月的四次 downtime 說明，大致上都跟 mysql1 這組 cluster 有關：「An update on recent service disruptions」，這是 Keith Ballinger 發的文章，找了一下掛的頭銜是 SVP of Engineering at GitHub。

文章裡提到的 mysql1 在「Partitioning GitHub’s relational databases to handle scale」這邊可以看到一些資訊 (我在「GitHub 的 MySQL 架構與數字」這邊也有提到)，基本上有 ProxySQL + Vitess 兩套方案在 scale，但可以看出來主資料庫本身還是有很大的 loading 在上面跑。

這次的問題是 mysql1 看起來這次遇到了效能上的瓶頸，不過還是沒找到原因，這可以從這幾次的說明看出來，從第一次的 outage：

The incident appeared to be related to peak load combined with poor query performance for specific sets of circumstances.

第二次的：

The following day, we saw the same peak traffic pattern and load on mysql1. We were not able to pinpoint and address the query performance issues before this peak, and we decided to proactively failover before the issue escalated.

第三次的：

While we had reduced load seen in the previous incidents, we were not fully confident in the mitigations.

In this third incident, we enabled memory profiling on our database proxy in order to look more closely at the performance characteristics during peak load.

到最近第四次的：

In order to reduce load, we throttled webhook traffic and will continue to use that as a mitigation to prevent future recurrence during peak load times as we continue to investigate further mitigations.

可以看到基本上還沒完，之後再遇到問題時應該還是會把 webhook traffic 拿出來開刀...

這個月 GitHub 的不穩問題，都是 mysql1 這個 cluster 的鍋...

Related

Leave a Reply