YouTube 史上第一個破四十億次播放的影片:Despacito

前幾天 2017/10/11 達到了:「Luis Fonsi & Daddy Yankee's 'Despacito' Is First Clip to Hit 4 Billion Views on YouTube」。

出自「Gangnam Style (music_video)」這頁。

不到一年的時間 XD

依照這個速度,應該是下個月會破 UINT_MAX (4294967295)?當年 YouTube 的人有在 Google+ 上發了一則關於超過 INT_MAX 的 PR 稿 (不過後來刪除了):「Gangnam Style overflows INT_MAX, forces YouTube to go 64-bit」。


Yahoo! 的資料外洩數量超過之前公佈的十億筆,上升到三十億筆

Oath (Y! 的新東家,Verizon 持股) 發表了新的通報,外洩數量直接上升到 3 billion 了:「Yahoo provides notice to additional users affected by previously disclosed 2013 data theft」。


Subsequent to Yahoo's acquisition by Verizon, and during integration, the company recently obtained new intelligence and now believes, following an investigation with the assistance of outside forensic experts, that all Yahoo user accounts were affected by the August 2013 theft.

在「Yahoo says all 3 billion user accounts were impacted by 2013 security breach」這邊的報導則是寫的比較清楚,把當時的使用者數字翻出來:

Yahoo today announced that the huge data breach in August 2013 affected every user on its service — that’s all three billion user accounts and up from the initial one billion figure Yahoo initially reported.

2013 這包用的是 MD5 hash,以現在的運算能力來看,可以當作沒有 hash...:

The stolen user account information may have included names, email addresses, telephone numbers, dates of birth, hashed passwords (using MD5) and, in some cases, encrypted or unencrypted security questions and answers.

已經是 "all" 了,接下來要更大包只能是其他主題了...

Amazon S3 與 HDFS 的速度差異

作者繼續以 A Billion Taxi Rides 的資料測試各種差異,這次測了 Amazon S3HDFS 的速度差異:「A Billion Taxi Rides: AWS S3 versus HDFS」。

前半部都在說明測試的環境設定,重點在文章的最後面 (也就是「Benchmarking HDFS」這段),裡面有各種 query 的速度。HDFS 的速度大約是 Amazon S3 的 1.25 到 1.75 倍,作者給的結論是:

Though the speed improvements using HDFS are considerable, S3 did perform pretty well. At worst there's a 1.75x overhead in exchange for virtually unlimited scalability, 11 9's of durability and no worrying about over/under-provisioning storage space.

雖然 HDFS 比較快,但 Amazon S3 其實表現的不錯,另外資料安全性 (平均 99.999999999%,也就是 11 個 9 的 durability) 及不需要怕空間不夠的優點也是應該考慮進去的因素。

A Billion Taxi Rides 資料分析系列

Mark Litwintschik 最近在連載 A Billion Taxi Rides 的資料分析系列作品:

同樣的資料 (而且這個資料量夠大,拿來 benchmark 比較有參考價值),用不同的工具分析,對於要挑工具的人可以看一看,另外也因為裡面給了很多 command sample,要自己動手測試也是個很棒的資料...