前幾天 Reddit 宣佈放出完整的全站投稿資料:「Full Reddit Submission Corpus now available (2006 thru August 2015)」,有些技術問題使得這次沒放出 2006 與 2007 的資料,之後會想辦法補上:
Data is complete from January 01, 2008 thru August 31, 2015. Partial data is available for years 2006 and 2007. The reason for this is that the id's used when Reddit was just a baby were scattered a bit -- but I am making an attempt to grab all data from 2006 and 2007 and will make a supplementary upload for that data once I'm satisfied that I've found all data that is available.
約 42GB 的資料,幾乎是公開的資料都包含進去了:
This dataset represents approximately 200 million submission objects with score data, author, title, self_text, media tags and all other attributes available via the Reddit API.
檔案放在 Amazon S3 上,不過有人貼出對應的 BitTorrent 連結了,最重要的 btih 值是 9941b4485203c7838c3e688189dc069b7af59f2e。
可以拿來做各種研究...
不確定紀錄格式,內容是否相同,
有人先貼了非官方版的資料
(October 2007 to May 2015)
ref:
https://www.reddit.com/r/datasets/comments/3bxlg7/i_have_every_publicly_available_reddit_comment/
http://blaze.pydata.org/blog/2015/09/16/reddit-impala/