Fake GitHub Star 的生意

昨天在 Hacker News 首頁上看到「Tracking the Fake GitHub Star Black Market (dagster.io)」這篇,原文在「Tracking the Fake GitHub Star Black Market with Dagster, dbt and BigQuery」這邊。

作者群想要偵測 GitHub 上面 fake star 的行為,所以就跑去找黑市買,然後找到了兩家,Baddhi Shop (1000 個 $64) 與 GitHub24 (每個 €0.85,大約是 $0.91),價錢差異很大,「品質」差異也很大:貴的 star 在一個月後還是存在,而便宜的看起來有一些有被 GitHub 偵測到而清除掉:

A month later, all 100 GitHub24 stars still stood, but only three-quarters of the fake Baddhi Shop stars remained. We suspect the rest were purged by GitHub’s integrity teams.

接下來就是想要系統化分析,切入點是 GH Archive 這個服務,可以直接下載 GitHub 全站上的 public evnets 資料:

GH Archive is a project to record the public GitHub timeline, archive it, and make it easily accessible for further analysis.

想要偵測兩種不同的 fake account,第一種是 obvious fake account,定義成這樣:

  • Created in 2022 or later
  • Followers <=1
  • Following <= 1
  • Public gists == 0
  • Public repos <=4
  • Email, hireable, bio, blog, and twitter username are empty
  • Star date == account creation date == account updated date

從定義就可以看出來完全就是灌水帳號,開出來就沒在動的。從 screenshot 可以看出這種帳號長的都一樣:

另外一種則是透過演算法去分析,這邊拿 unsupervised clustering 類的演算法分析出來的結果,可以看到抓到很多:

最近 NN 類的 machine learning 演算法太多,看到這些傳統的 machine learning 演算法還是覺得頗新鮮的...


Hacker News Daily 上看到 Palette 這個服務,作者在 Hacker News 上有提到你可以提供一些句子調整顏色:「Show HN: I made a new AI colorizer (palette.fm)」。

Hi HN, I’m Emil, the maker behind Palette. I’ve been tinkering with AI and colorization for about five years. This is my latest colorization model. It’s a text-based AI colorizer, so you can edit the colorizations with natural language. To make it easier to use, I also automatically create captions and generate filters.

作者有把一些作品貼在 Reddit 上面,可以參考 https://www.reddit.com/user/emilwallner/?sort=top 這邊,看起來已經有一陣子了...


前幾天在 Hacker News Daily 上看到「TV backlight compensation (2020) (lofibucket.com)」這個,原文是 2020 年的文章「TV backlight compensation」,作者拿到一台不用錢的電視機,但這台電視機的背光已經老化了,發光亮度不平均,大概是這樣的情況:

把電視機的背光問題當作是一個函數 f(x),然後把翻拍產生的亮度差當作是 g(x),目標是要找出 f(x) 的反函數 f-1(x):

然後從作者的程式碼與解釋可以看出來不是針對每個 pixel 都單獨調整,而是全部套用同一個公式,然後目標是針對黑白畫面去調整:

The whole point of this exercise was to make black-and-white movies look better.


作者是把這個修正掛到 MPC-BE 上面,但 Hacker News 上面有人提到也可以實做 Gnome 的版本,直接讓整個 OS 都可以套用:「Hello1024/gse-shader」。

有點拼但還蠻有趣的東西 XD

Python 的 Black

Hacker News 上看到 Black 這個幫你處理 Python 程式碼的工具:「Black, the uncompromising Python code formatter, is stable (pypi.org)」。

Black is the uncompromising Python code formatter. By using it, you agree to cede control over minutiae of hand-formatting. In return, Black gives you speed, determinism, and freedom from pycodestyle nagging about formatting. You will save time and mental energy for more important matters.

然後從 Hacker News 上討論的情況看起來大家都覺得很不錯?好像可以看看能不能拿來用...

另外一個在討論的時候看到學到的東西,是 git blame --ignore-revs-file 這個功能,可以在 git blame 時濾掉某些 commit,剛好拿來過濾 reformatting commit:

Ignore revisions listed in file, which must be in the same format as an fsck.skipList. This option may be repeated, and these files will be processed after any files specified with the blame.ignoreRevsFile config option. An empty file name, "", will clear the list of revs from previously processed files.

StackOverflow 開賣 Ctrl、C、V 的鍵盤

StackOverflow 今年愚人節的鍵盤真的開賣了:「No joke—you can buy our copy/paste keyboard right now」。

愚人節的文章在「Introducing The Key」,這次開賣的網站是跟 Drop 合作:「Stack Overflow The Key Macropad | Mechanical Keyboards | Mini Mechanical Keyboards | Drop」,可以看到是機械鍵盤,但要 US$29 一隻...

鍵盤是凱華 BOX 黑軸:

They’re also outfitted with Kailh Box Black switches to deliver an ultra-smooth linear feel.

然後可程式化定義 XDDD

Fully programmable, these three keys can do much more than copy and paste. In fact, you can configure them to perform virtually any key command you want.

不過想要的人也得注意一下,目前看到的 ship date 是年底了:

Estimated ship date is Dec 13, 2021 PT.

然後目前已經賣出 2.6k 件了?XDDD

2.6k Sold

Python 的 code formatter:Black

Black 是一套 Python 上的 Code Formatter,可以幫你重排程式碼以符合 coding style 與 coding standard,比起只是告訴你哪邊有錯來的更進階...

記得以前好像不是掛在官方帳號下面的,翻了一下發現在 Hacker News 上的「https://news.ycombinator.com/item?id=17151813」這則可以看到,去年在 ambv 的 repository 上,現在則是被導到 python 的組織下了 :o

目前還是掛 beta,另外有不少 practice 讓人不太舒服,像是 Hacker News 上「https://news.ycombinator.com/item?id=19939806」這邊提到的:

Against my better judgment I'll bite.
I super dislike black's formatting, and I think it's really rare to actually see it in codebases. It wraps weirdly (sometimes not at all). I'd prefer to use yapf, but last I checked it still crashes on "f-strings".

Here's a small example:

        for satchel in satchels
        for apple in satchel
Black formats this as:
            for satchel in satchels
            for apple in satchel
I've never seen Python code like that.
I totally believe using a formatter is good practice. Black is in a challenging position of coming into a community with a lot of existing code and customs, and I get that. But I also think that's an opportunity, rather than having to guess at what is good, there's a wealth of prior art to look at. I wish it had done this, rather than essentially codify the author's style.

看起來還有很多可以調整的,然後也可以考慮用看看... 以前是 3rd-party 還可以丟著不管,現在帶有官方色彩得看一下 :o

前陣子在 Black Hat 上發表的 HEIST 攻擊 (對 HTTPS 的攻擊)

又是一個對 HTTPS 的攻擊:「HEIST attack on SSL/TLS can grab personal info, Black Hat」、「New attack steals SSNs, e-mail addresses, and more from HTTPS pages」。

一樣是 Compression 產生的 side-channel attack,只是這次是結合 TCP window size 的攻擊。投影片可以在「HTTP Encrypted Information can be Stolen through TCP-windows (PDF)」這邊看到。

這次的攻擊只需要在瀏覽器上插入 HTTP 產生 HTTPS 的流量,然後從旁邊看 HTTPS 連線的 TCP packet 就可以了,而且對 HTTP/2 也很有效:

而且很不幸的,目前沒有太好的解法,因為所有的攻擊手法都是照著使用者無法發現的路徑進行的 @_@

對於使用者,大量使用 HTTPS (像是 HTTPS Everywhere 這樣的套件),能夠降低政府單位與 ISP 將 javascript 插入 HTTP 連線,產生 HTTPS 的行為。

而對於網站端來說,全站都隨機產生不同長度的 HTTP header 可能是個增加破解難度的方式 (而且不會太難做,可以透過 apache module 或是 nginx module 做到),但還是可以被統計方法再推算出來。

不知道有沒有辦法只對 HTTPS 開 javascript,雖然攻擊者還是可以用 <img> 攻擊...

也許以後 HTTP/3 之類的協定會有一區是不壓縮只加密的,避開這類 compression-based attack @_@


在「Black-market archives」這篇給出了一份很寶貴的資料,是來自於 Tor hidden service 上的 Dark Net Markets (DNM)。

這份資料涵蓋了 2013 到 2015 年的各種紀錄:

From 2013-2015, I scraped/mirrored on a weekly or daily basis all existing English-language DNMs as part of my research into their usage, lifetimes/characteristics, & legal riskiness; these scrapes covered vendor pages, feedback, images, etc.

大約壓縮後 50GB 的資料:

This uniquely comprehensive collection is now publicly released as a 50GB (~1.6TB) collection covering 89 DNMs & 37+ related forums, representing <4,438 mirrors, and is available for any research.

Tor 的 hidden service 應該只會愈來愈流行,初期的這些資料會讓後人有很多題材可以分析...