github – Page 11 – Gea-Suan Lin's BLOG

GitHub 重新定位 Redis 的功能...

GitHub Engineering 說明了他們為什麼改變 Redis 的使用情境：「Moving persistent data out of Redis」。

在 GitHub 裡面，Redis 有兩種不同的情境，一種叫做 transient Redis，只用做 cache：

We used it as an LRU cache to conveniently store the results of expensive computations over data originally persisted in Git repositories or MySQL. We call this transient Redis.

另外一種則是打開 persistence 功能，叫做 persistent Redis：

We also enabled persistence, which gave us durability guarantees over data that was not stored anywhere else. We used it to store a wide range of values: from sparse data with high read/write ratios, like configuration settings, counters, or quality metrics, to very dynamic information powering core features like spam analysis. We call this persistent Redis.

這邊講的是 persistent Redis 被換成用 MySQL (InnoDB) 儲存：

Recently we made the decision to disable persistence in Redis and stop using it as a source of truth for our data. The main motivations behind this choice were to:

Reduce the operational cost of our persistence infrastructure by removing some of its complexity.

Take advantage of our expertise operating MySQL.

Gain some extra performance, by eliminating the I/O latency during the process of writing big changes on the server state to disk.

For the majority of callsites, we replaced persistent Redis with GitHub::KV, a MySQL key/value store of our own built atop InnoDB, with features like key expiration. We were able to use GitHub::KV almost identically as we used Redis: from trending repositories and users for the explore page, to rate limiting to spammy user detection.

後面講了不少轉換的過程 (還包含了某些功能的改寫)，但沒有講的太清楚為什麼不繼續使用 Redis。

目前只能就提到的三點問題來看，persistent 的 i/o 成本可能太高？而且難以再壓榨效能出來？而相反的，InnoDB 已經花了很多力氣在上面，直接拿來用反而可以解決問題？

不過看得出來這個轉換還是花了不少力氣，看得出來有些 application 使用 Redis 的模式不能直接搬到 InnoDB 上，花了時間改寫...

Unix 程式碼演進的記錄

在 GitHub 上的「dspinellis/unix-history-repo」專案放進了 Unix 程式碼從 1970 年演進到 2016 年的記錄：

The history and evolution of the Unix operating system is made available as a revision management repository, covering the period from its inception in 1970 as a 2.5 thousand line kernel and 26 commands, to 2016 as a widely-used 27 million line system.

主要的目的是讓研究人員可以直接分析，減少重複的工作：

The project aims to put in the repository as much metadata as possible, allowing the automated analysis of Unix history.

後面的分支主要是以 FreeBSD 為主：(在列表的部份也可以看到)

It has been created by synthesizing with custom software 24 snapshots of systems developed at Bell Labs, the University of California at Berkeley, and the 386BSD team, two legacy repositories, and the modern repository of the open source FreeBSD system.

整個 repository 頗壯觀的 XD

GitHub 在 Merge Pull Request 時支援 Rebase 了

有些人認為儘量保持原狀，但有些人認為儘量維持 tree 的乾淨，而這次推出的 rebase 則是把後者的需求補上了：「Rebase and merge pull requests」。

GitHub 又多了許多功能...

GitHub 上個禮拜推出了不少功能出來：「A whole new GitHub Universe: announcing new tools, forums, and features」。

功能多了不少，但比較亮眼的主要是 Project 的功能，界面上有點像 Trello：

操作上也可以看到，一個 repository 可以開很多 project，然後在裡面移來移去：

With Projects, you can manage work directly from your GitHub repositories. Create cards from Pull Requests, Issues or Notes and organize them into custom columns, whether it’s "In-progress", "Done", "Never going to happen" or any other framework your team uses. Drag and drop the cards inside a column to prioritize them or move them from one column to another as your work progresses.

唔... 這樣好用不少 :o

在遊戲上模擬跑步的動作

從「Balls to learning how to animate, let's film some parkour!」這篇看到讓人懷念的遊戲，1989 年的《波斯王子》：

Jordan Mechner (波斯王子的作者) 用 Rotoscoping 的方式將他弟弟做這些動作的畫面拍下來，然後確保在電腦上角色的動作是順暢的：

Here's the source frames used to rotoscope the above animation. Don't let the ghostly pallor fool you! Jordan Mechner's brother is in fact quite healthy; he was altered with state-of-the-art Liquid Paper and Sharpie technology to fit the palette restrictions of the Apple II.

而波斯王子的原始程式碼在 2012 的時候也從 3.5 吋磁片上順利拉出來，放到 GitHub 上：「jmechner/Prince-of-Persia-Apple-II」。

Thanks to Jason Scott and Tony Diaz for successfully extracting the source code from a 22-year-old 3.5" floppy disk archive, a task that took most of a long day and night, and would have taken much longer if not for Tony's incredible expertise, perseverence, and well-maintained collection of vintage Apple hardware.

分析 GitHub 上的 Tab 與 Space

作者用 BigQuery 分析了 GitHub 上的 Tab 與 Space 的差異 (是個 flame war 開始的節奏 XDDD)：「400,000 GitHub repositories, 1 billion files, 14 terabytes of code: Spaces or Tabs?」。

可以看到除了 C 與 Go 以外，大多數的程式語言都是 Space > Tab。另外在文章下面也有使用的 BigQuery 指令可以參考。

另外一篇講文件掃描的...

在「Page dewarping」這篇看到講文件掃描的技術，以及 open source 的程式，對比之前提到的「Dropbox 的文件掃描功能」與「Dropbox 的 Document Detecting」的時間點，有種淡淡的惡意 XD

這篇作者是為了未婚妻的需求而寫出來的，本來是作者收到學生的作業時手動在跑，後來未婚妻也拿去用，但量愈來愈大，決定自動化處理：

A while back, I wrote a script to create PDFs from photos of hand-written text. It was nothing special – just adaptive thresholding and combining multiple images into a PDF – but it came in handy whenever a student emailed me their homework as a pile of JPEGs. After I demoed the program to my fiancée, she ended up asking me to run it from time to time on photos of archival documents for her linguistics research. This summer, she came back from the library with a number of images where the text was significantly warped due to curled pages.

So I decided to write a program that automatically turns pictures like the one on the left below to the one on the right:

程式都可以在 GitHub 上翻到：「Text page dewarping using a "cubic sheet" model」。跟 Dropbox 互別苗頭的感覺 XDDD

GitHub Pages 可以吃其他 branch 了

之前的 GitHub Pages 都只能吃 gh-pages 這個 branch，而 GitHub 改善了這個部份：「Simpler GitHub Pages publishing」。

可以直接選擇 master branch，這樣對大多數的情況下簡單多了：

另外也可以選擇 /docs 下，與其他目錄的資料隔開。

在 Pull Request 後修改 target branch

GitHub 的新功能，在 pull request 後修改 target branch，這樣可以做後續討論或是調整，然後再 merge 回 dev 或是 master branch 上：「Change the base branch of a Pull Request」。

GitHub 發展出來的 ALTER TABLE 方式

GitHub 解釋了他們在 MySQL 上 ALTER TABLE 的方式：「gh-ost: GitHub's online schema migration tool for MySQL」。

GitHub 的舊方式是使用 pt-online-schema-change，會遇到的問題有幾個，其中看起來只有 Non pausability 這個是真正的痛點：

Non pausability: when load on the master turns high, you wish to throttle or suspend your pending migration. However a trigger-based solution cannot truly do so. While it may suspend the row-copy operation, it cannot suspend the triggers. Removal of the triggers results in data loss. Thus, the triggers must keep working throughout the migration. On busy servers, we have seen that even as the online operation throttles, the master is brought down by the load of the triggers.

當開始後，多出來的 trigger 是沒有辦法停下來的 (停下來就代表要全部重來)，而且會影響線上服務。

新的方式則是用 replication 做，多一台機器出來跑，等結束後再切換，而中間有任何過程也都很好處理：

這方法手筆比較大，不過對於系統已經有規模的組織來說不是問題... 看起來以後可以朝這個方向研究 XD