GitHub 在月初的時候把所有人都 logout,然後前幾天發文解釋了當時的情況:「How we found and fixed a rare race condition in our session handling」。
起因於月初時有使用者回報他在登入後,變成其他人的身份:
On March 2, 2021, we received a report via our support team from a user who, while using GitHub.com logged in as their own user, was suddenly authenticated as another user. They immediately logged out, but reported the issue to us, as it rightfully worried them.
後面其實就是在講他們在改善 github.com 的效能時是在 Rails 架構上疊上許多 threading 的機制,但是沒有處理好 critical section 與 object reuse 而造成後續的問題。
知道是 thread safety 的問題發生點後,其實就大概知道怎麼解決,主要還是 GitHub 在這篇文章裡面透漏了不少有趣的技術。
首先是 github.com 有保留 HTTP header 與 HTTP body,而且有記錄是在哪台機器、哪個 process 處理的,這對於事後找問題時很有幫助:
From reviewing logs, we could gather that the HTTP body in the response to the client we sent was correct and only the cookies in the response to the user were wrong. The affected users from the support reports received a session cookie from a user who very recently had a request handled inside the same process. In one case, the two requests were handled sequentially, one after the other. In the second case, there were two other requests in between.
不確定是不是所有的 HTTP request 都有記錄,以 GitHub 的量來說應該是蠻可觀的,但感覺上現代的硬體好像又可以暴力解...
另外是 github.com 引入了 threading 技術改善效能,不過這邊不確定這邊是用 C/C++ 寫,還是單純用 Ruby 本身提供的 threading 撰寫:
Threads were already used in other places in this application, but the new background thread produced a novel and unforeseen interaction with our exception handling routines. When exceptions were reported from a background thread, such as a query timeout, the error log would contain information from both the background thread and the currently running request, showing that the data was being pulled across threads.
這種最佳化的方式只有在夠大的服務上做才有效益,只能說 GitHub 的人比較無奈,threading 掛上一個已經很複雜的應用程式的確是容易中獎...