前陣子 GitHub 發了一篇文章,說明自己開發搜尋引擎的心路歷程:「The technology behind GitHub’s new code search」。
看了一下其實就是自己幹了一套 search engine cluster,然後針對 code search 把一些功能放進去。
目前這套 search enginer 還是 beta 版本,全站兩億個 repository 只包括了 4500 萬 (大概 22% 左右),然後已經有 115TB 的程式碼了;另外也題到了先前導入 Elasticsearch 時的數字是 800 萬個 repository:
GitHub’s scale is truly a unique challenge. When we first deployed Elasticsearch, it took months to index all of the code on GitHub (about 8 million repositories at the time). Today, that number is north of 200 million, and that code isn’t static: it’s constantly changing and that’s quite challenging for search engines to handle. For the beta, you can currently search almost 45 million repositories, representing 115 TB of code and 15.5 billion documents.
目前是 32 台機器,沒有特別提到記憶體大小,也沒有提到 replication 之類的數字:
Code search runs on 64 core, 32 machine clusters.
然後各種 inverted index 與各種資料在壓縮後只有 25TB:
There are some big wins on the size of the index as well. Remember that we started with 115 TB of content that we want to search. Content deduplication and delta indexing brings that down to around 28 TB of unique content. And the index itself clocks in at just 25 TB, which includes not only all the indices (including the ngrams), but also a compressed copy of all unique content. This means our total index size including the content is roughly a quarter the size of the original data!
換算一下,就會發現現在已經是「暴力」可以解很多事情的年代了,而這已經是全世界最大的 code hosting。
以前隨便一個主題搞大一點就會撞到 Amdahl's law,現在輕鬆不少...