GitHub 更換 github.com 的 SSH host key (RSA 部份)

看到 GitHub 宣佈更換 SSH key (RSA 的部份):「We updated our RSA SSH host key」。

Hacker News 上的「We updated our RSA SSH host key (github.blog)」這邊有人提到官方文件 (原文裡面也有提到),裡面會列出對應的 key:「GitHub's SSH key fingerprints」。

只是看起來這些大公司依然對 DNSSEC + SSHFP 這樣的組合沒興趣...

Fake GitHub Star 的生意

昨天在 Hacker News 首頁上看到「Tracking the Fake GitHub Star Black Market (dagster.io)」這篇,原文在「Tracking the Fake GitHub Star Black Market with Dagster, dbt and BigQuery」這邊。

作者群想要偵測 GitHub 上面 fake star 的行為,所以就跑去找黑市買,然後找到了兩家,Baddhi Shop (1000 個 $64) 與 GitHub24 (每個 €0.85,大約是 $0.91),價錢差異很大,「品質」差異也很大:貴的 star 在一個月後還是存在,而便宜的看起來有一些有被 GitHub 偵測到而清除掉:

A month later, all 100 GitHub24 stars still stood, but only three-quarters of the fake Baddhi Shop stars remained. We suspect the rest were purged by GitHub’s integrity teams.

接下來就是想要系統化分析,切入點是 GH Archive 這個服務,可以直接下載 GitHub 全站上的 public evnets 資料:

GH Archive is a project to record the public GitHub timeline, archive it, and make it easily accessible for further analysis.

想要偵測兩種不同的 fake account,第一種是 obvious fake account,定義成這樣:

  • Created in 2022 or later
  • Followers <=1
  • Following <= 1
  • Public gists == 0
  • Public repos <=4
  • Email, hireable, bio, blog, and twitter username are empty
  • Star date == account creation date == account updated date

從定義就可以看出來完全就是灌水帳號,開出來就沒在動的。從 screenshot 可以看出這種帳號長的都一樣:

另外一種則是透過演算法去分析,這邊拿 unsupervised clustering 類的演算法分析出來的結果,可以看到抓到很多:

最近 NN 類的 machine learning 演算法太多,看到這些傳統的 machine learning 演算法還是覺得頗新鮮的...

GitHub 自己開發的搜尋引擎

前陣子 GitHub 發了一篇文章,說明自己開發搜尋引擎的心路歷程:「The technology behind GitHub’s new code search」。

看了一下其實就是自己幹了一套 search engine cluster,然後針對 code search 把一些功能放進去。

目前這套 search enginer 還是 beta 版本,全站兩億個 repository 只包括了 4500 萬 (大概 22% 左右),然後已經有 115TB 的程式碼了;另外也題到了先前導入 Elasticsearch 時的數字是 800 萬個 repository:

GitHub’s scale is truly a unique challenge. When we first deployed Elasticsearch, it took months to index all of the code on GitHub (about 8 million repositories at the time). Today, that number is north of 200 million, and that code isn’t static: it’s constantly changing and that’s quite challenging for search engines to handle. For the beta, you can currently search almost 45 million repositories, representing 115 TB of code and 15.5 billion documents.

目前是 32 台機器,沒有特別提到記憶體大小,也沒有提到 replication 之類的數字:

Code search runs on 64 core, 32 machine clusters.

然後各種 inverted index 與各種資料在壓縮後只有 25TB:

There are some big wins on the size of the index as well. Remember that we started with 115 TB of content that we want to search. Content deduplication and delta indexing brings that down to around 28 TB of unique content. And the index itself clocks in at just 25 TB, which includes not only all the indices (including the ngrams), but also a compressed copy of all unique content. This means our total index size including the content is roughly a quarter the size of the original data!

換算一下,就會發現現在已經是「暴力」可以解很多事情的年代了,而這已經是全世界最大的 code hosting。

以前隨便一個主題搞大一點就會撞到 Amdahl's law,現在輕鬆不少...

GitHub 宣佈將在 2024 移除對 Subversion 支援

GitHub 從 2010 年的愚人節時宣佈支援 Subversion:「Announcing SVN Support」,雖然是愚人節的功能,但這個功能是會動的。

而這個功能出現十多年後,可以預期用的人愈來愈少。前幾天 GitHub 宣佈將在大約一年後的 2024/01/08 停止支援 Subversion:「Sunsetting Subversion support」。

然後在 Hacker News 上的討論「GitHub Sunsetting Subversion Support (github.blog)」則是直接讓 Scott Chacon (GitHub co-founder,同時也是第一篇公告文的作者 + 這個功能的兇手之一) 出來解釋當初搞出這個東西的前因與後果,還有一些感想。

裡面有提到這個功能當初推出來的時候是個好玩的性質,但意外的在上線後發現也讓一些老系統可以比較容易轉移:也就是讓 developer 可以先開始用 Git,但 CI 類的工具可以先不用改。

As one of the GitHub cofounders and the brainchild of this particular feature, I want to let everyone know that this is maybe the funniest thing I've ever done.

We released this feature and published the announcing blog post, on April Fool's Day, 2010. I remember demoing it to the other GitHub guys and saying how funny it would be if we made this an April Fool's day post as though it was a big stupid joke but then it actually completely worked on every repository we had and we all thought it would be great. Until nobody believed us. Which in hindsight we should have seen coming, since that was the joke, but nobody actually tried it. Then people tried it and it worked and they thought it was a trick or something.

It was really helpful for people migrating from legacy SVN based systems to us (CI and stuff) but I'm surprised to some degree that it's still running 13 years later when nobody is really facing that issue anymore. And I'm still undecided if the joke was worth the massive confusion it caused. But if I'm pressed, I would say that I would 100% release it on April Fool's Day again.

不過想要拔掉這個功能的起因不知道是什麼,技術債不知道有多少...

GitHub 的 2FA 新計畫

GitHub 決定擴大強制使用 2FA 的範圍:「Raising the bar for software security: next steps for GitHub.com 2FA」。

本來的 2FA policy 是在「Software security starts with the developer: Securing developer accounts with 2FA」這邊提到的,所有的使用者預定在 2023 年年底強制使用 2FA:

GitHub will require all users who contribute code on GitHub.com to enable one or more forms of two-factor authentication (2FA) by the end of 2023.

另外針對 npm 熱門套件的維護者,則是在三月 (top 100) 與五月 (top 500) 就強制要使用了:

In February we enrolled all maintainers of the top-100 packages on the npm registry in mandatory 2FA, and in March we enrolled all npm accounts in enhanced login verification. On May 31, we will be enrolling all maintainers of the top-500 packages in mandatory 2FA.

但到 2023 年年底才有動作會有點久,所以這次宣佈在 2023 年三月會插入一個新階段,強制這些人要有 2FA 才能進行變更類的操作:

  • Users who published GitHub or OAuth apps or packages
  • Users who created a release
  • Users who are Enterprise and Organization administrators
  • Users who contributed code to repositories deemed critical by npm, OpenSSF, PyPI, or RubyGems
  • Users who contributed code to the approximate top four million public and private repositories

以 developer 為主的 GitHub 大力在推 2FA,其他的服務不知道之後會不會有類似動作...

WebKit 專案將從 Subversion 搬到 GitHub 上

Hacker News 首頁上看到 WebKit 專案宣佈從本來的 Subversion 搬到 GitHub 上:「WebKit on GitHub!」,新的專案位置在 WebKit/WebKit 這邊。對應的討論在「WebKit Migrates from Subversion to GitHub (webkit.org)」這邊。

但 issue tracking system 的部份目前看起來還是繼續用 WebKit Bugzilla,沒有開 GitHub 的 issue 功能,但至少是從 Subversion 換成 Git 了。

在 Hacker News 上的討論意外看到已經是歷史的 SVK (試著在 Subversion 上面堆一些功能),而且還聊到了一些 tricky 的技巧:

Funny story: my first task when I joined the original iPhone team was to merge our forked WebKit with master. It was a sort of hazing ritual slash "when else would we do it but when someone new joins?". Anyways, we used a tool called SVK[1] in order to get very primitive "git-like" abilities. It was basically a bunch of scripts that used SVN under the hood. For example, in order to get the "local everything"-style behaviors of git, the very first thing it did was checkout every single version of the repository in question. For WebKit, this meant that the first day was spent leaving the computer alone and letting it download for hours. I made the mistake of having a space somewhere in the path of the target folder, which broke something or other, so I ended up having to do it all over again.

Anyways, I distinctly remember one of the instructions for merging WebKit in our internal wiki being something like "now type `svk merge`, but hit ctrl-c immediately after! You don't want to use the built-in merge, it'll break everything, but this is the only way to get a magic number that you can find stored in [some file] after the merge has started. If it's not there, try `svk merge` again and let it go a little longer than last time." A few hires later (I think possibly a year after) someone set up a git mirror internally to avoid having to do this craziness, which if I remember correctly, was treated with some skepticism. This was 2007, so why would we try some new-fangled git thing when we had svk?

1. https://wiki.c2.com/?SvkVersionControl

我記得那個時間點 VCS 的選擇的確是個有趣的決策過程... 除了 Git 以外還有 Mercurial,另外還有幾個當時就已經算小眾的 open source solution。

而到了 2010 後就比較明朗了,現在幾乎是 Git 一統天下了,Mercurial 目前最大的使用者應該是 Meta (Facebook)?

跑在本機的 GitHub Copilot 替代品

Hacker News 上看到「FauxPilot – an attempt to build a locally hosted version of GitHub Copilot (github.com/moyix)」這個本機上跑 GitHub Copilot 協定的專案。專案的 GitHub 在「FauxPilot - an open-source GitHub Copilot server」這邊。

裡面用的是 Salesforce 放出來的 CodeGen,不過 Salesforce 提供了 350M、2B、6B 與 16B 的 model,但在 FauxPilot 這邊目前只看到 350M、6B 與 16B 的 model 可以用,少了 2B 這組,然後需要的 VRAM 就有點尷尬了:

[1] codegen-350M-mono (2GB total VRAM required; Python-only)
[2] codegen-350M-multi (2GB total VRAM required; multi-language)
[3] codegen-6B-mono (13GB total VRAM required; Python-only)
[4] codegen-6B-multi (13GB total VRAM required; multi-language)
[5] codegen-16B-mono (32GB total VRAM required; Python-only)
[6] codegen-16B-multi (32GB total VRAM required; multi-language)

13GB 剛好超過 3080 Ti 的 12GB,所以不是 3090 或 3090 Ti 的使用者就只能跑 350M 這個版本?看 Hacker News 上的討論似乎是有打算要弄 2B 的版本啦...

然後我自己雖然是 11GB 的 1080 Ti,想跑個 350M 的版本測試看看,但看起來相關的 Nvidia driver 沒裝好造成他識別不到,加上我是用 neovim,看了一下目前 ~/.config/github-copilot/hosts.json 的內容,程式碼應該是寫死到 GitHub API 上使用:

{"github.com":{"user":"gslin","oauth_token":"x"}}

先暫時放著好了,晚點等 2B 版本出現後再回來看看有沒有比較完整的指示...

測試 Neovim + GitHub Copilot

如同之前在「GitHub Copilot 宣佈 GA」提到的,Copilot 有支援 Neovim,找了一下在 GitHub 上的 github/copilot.vim 這邊可以取得。

Copilot.vim is a Vim plugin for GitHub Copilot. For now, it requires Neovim 0.6 (for virtual lines support) and a Node.js installation.

主要有兩個 dependency 問題,第一個是 Neovim 版本要 0.6+,而在 Ubuntu 20.04 內的版本不夠新 (22.04 的看起來就夠),可以裝 PPA 版本解決:「Neovim Stable」。

另外一個是 Node.js 版本需要到 16+ (20.04 與 22.04 內建的都不夠),這個我是靠 nvm 解決。

先在 GitHub 網站上開通 Copilot,再照著說明,回到 Neovim 裡執行 :Copilot setup,跟著步驟跑授權流程就可以了。

接下來隨便開個 test.py 或是 test.php 檔開始寫,就會發現有 suggestion 跑出來了。

這邊拿 feedgen 測試會不會動,輸入 feed. 後就會出現灰色的 subtitle(title)

這時候按 tab 就會展出來了。

AWS 也推出了 GitHub Copilot 的競爭對手 Amazon CodeWhisperer

AWS 推出了 Amazon CodeWhisperer,可以看做是 GitHub Copilot 的競爭產品:「Now in Preview – Amazon CodeWhisperer- ML-Powered Coding Companion」,在 Hacker News 上的討論還不多:「Copilot just got company: Amazon announced Codewhisperer (amazon.com)」。

目前還是 Preview 所以是免費的,但也還沒有提供價錢:

During the preview period, developers can use CodeWhisperer for free.

另外目前提供的程式語言只有 PythonJavaJavaScript

The preview supports code written in Python, Java, and JavaScript, using VS Code, IntelliJ IDEA, PyCharm, WebStorm, and AWS Cloud9. Support for the AWS Lambda Console is in the works and should be ready very soon.

至於 training 的資料集,這邊有提到的是 open source 專案與 Amazon 自家的東西:

CodeWhisperer code generation is powered by ML models trained on various data sources, including Amazon and open-source code.

開發應該需要一段時間,不知道是剛好,還是被 GitHub Copilot 轉 GA 的事件強迫推出 Preview 版...

GitHub Copilot 宣佈 GA

GitHub Copilot 宣佈 GA:「GitHub Copilot is generally available to all developers」,Hacker News 上的討論可以看一下:「GitHub Copilot is generally available (github.blog)」。

價錢也出來了,US$10/mo 或是 US$100/year:

We’re making GitHub Copilot, an AI pair programmer that suggests code in your editor, generally available to all developers for $10 USD/month or $100 USD/year. It will also be free to use for verified students and maintainers of popular open source projects.

不過重點不是價錢,而是還沒有被挑戰過的 license 問題,像是在 Hacker News 上有人提到有些程式碼的授權是有感染性的 GPL 類的,這些在法院上還沒有被戰過。

不過還是很看好這個服務,畢竟可以處理掉很多無聊的 coding 時間... 查了一下發現 Neovim 已經有支援了,似乎可以來看看要怎麼玩 :o