Reddit 放出完整的全站投稿資料

前幾天 Reddit 宣佈放出完整的全站投稿資料:「Full Reddit Submission Corpus now available (2006 thru August 2015)」,有些技術問題使得這次沒放出 2006 與 2007 的資料,之後會想辦法補上:

Data is complete from January 01, 2008 thru August 31, 2015. Partial data is available for years 2006 and 2007. The reason for this is that the id's used when Reddit was just a baby were scattered a bit -- but I am making an attempt to grab all data from 2006 and 2007 and will make a supplementary upload for that data once I'm satisfied that I've found all data that is available.

約 42GB 的資料,幾乎是公開的資料都包含進去了:

This dataset represents approximately 200 million submission objects with score data, author, title, self_text, media tags and all other attributes available via the Reddit API.

檔案放在 Amazon S3 上,不過有人貼出對應的 BitTorrent 連結了,最重要的 btih 值是 9941b4485203c7838c3e688189dc069b7af59f2e。

可以拿來做各種研究...

CloudFlare 加速 zlib library

CloudFlare 大幅改善 zlib,使得速度相較原來的版本快了許多:「Fighting Cancer: The Unexpected Benefit Of Open Sourcing Our Code」。

改變的部份主要是 CPU 指令集的特性,以及 longest-match function 的改善。可以看出不同的測試樣本 (corpus) 在壓縮率沒有變差的情況下,大幅改善了速度。

Calgary corpus

Canterbury corpus

Large corpus

Silesia corpus

速度快非常多,跟 Google 以壓縮率為導向而放出來的 zopfli 剛好是兩個極端:「Google 發表與 zlib/deflate 相容的壓縮程式,再小 5%...」。