快速產生 SQLite 資料的方式:一分鐘內產生十億筆資料

在「Towards Inserting One Billion Rows in SQLite Under A Minute」這邊看到作者想要在一分鐘內在 MBP 2019 上面寫 1B 筆資料進 SQLite,裡面有些方法還蠻值得玩一下的,這台 MBP 2019 機器的規格是:

The machine I am using is MacBook Pro, 2019 (2.4 GHz Quad Core i5, 8GB, 256GB SSD, Big Sur 11.1)

第一版是 Python 寫的,塞 10M 筆花了 15 分鐘:

In this script, I tried to insert 10M rows, one by one, in a for loop. This version took close to 15 minutes, sparked my curiosity and made me explore further to reduce the time.

加了五個 PRAGMA 的版本變成 100M 筆十分鐘:

The naive for loop version took about 10 minutes to insert 100M rows.

用批次處理則可以降到八分半:

The batched version took about 8.5 minutes to insert 100M rows.

再來是拿經典神器 PyPy 出來用,降到兩分半:

All I had to do was run my existing code, without any change, using PyPy. It worked and the speed bump was phenomenal. The batched version took only 2.5 minutes to insert 100M rows. I got close to 3.5x speed :)

接下來就是跳槽到 Rust 了,中間也有不少 tuning 相關的討論,但直接先跳到最後面好了... 最後 100M 只用了 33 秒:

I created a threaded version, where I had one writer thread that received data from a channel and four other threads which pushed data to the channel. This is the current best version which took about 32.37 seconds.

能用 PyPy 的地方還是可以考慮一下的...

YouTube 史上第一個破四十億次播放的影片:Despacito

前幾天 2017/10/11 達到了:「Luis Fonsi & Daddy Yankee's 'Despacito' Is First Clip to Hit 4 Billion Views on YouTube」。

出自「Gangnam Style (music_video)」這頁。

不到一年的時間 XD

依照這個速度,應該是下個月會破 UINT_MAX (4294967295)?當年 YouTube 的人有在 Google+ 上發了一則關於超過 INT_MAX 的 PR 稿 (不過後來刪除了):「Gangnam Style overflows INT_MAX, forces YouTube to go 64-bit」。

所以這次會發嗎?

Yahoo! 的資料外洩數量超過之前公佈的十億筆,上升到三十億筆

Oath (Y! 的新東家,Verizon 持股) 發表了新的通報,外洩數量直接上升到 3 billion 了:「Yahoo provides notice to additional users affected by previously disclosed 2013 data theft」。

也就是當時所有的使用者都受到影響:

Subsequent to Yahoo's acquisition by Verizon, and during integration, the company recently obtained new intelligence and now believes, following an investigation with the assistance of outside forensic experts, that all Yahoo user accounts were affected by the August 2013 theft.

在「Yahoo says all 3 billion user accounts were impacted by 2013 security breach」這邊的報導則是寫的比較清楚,把當時的使用者數字翻出來:

Yahoo today announced that the huge data breach in August 2013 affected every user on its service — that’s all three billion user accounts and up from the initial one billion figure Yahoo initially reported.

2013 這包用的是 MD5 hash,以現在的運算能力來看,可以當作沒有 hash...:

The stolen user account information may have included names, email addresses, telephone numbers, dates of birth, hashed passwords (using MD5) and, in some cases, encrypted or unencrypted security questions and answers.

已經是 "all" 了,接下來要更大包只能是其他主題了...

Amazon S3 與 HDFS 的速度差異

作者繼續以 A Billion Taxi Rides 的資料測試各種差異,這次測了 Amazon S3HDFS 的速度差異:「A Billion Taxi Rides: AWS S3 versus HDFS」。

前半部都在說明測試的環境設定,重點在文章的最後面 (也就是「Benchmarking HDFS」這段),裡面有各種 query 的速度。HDFS 的速度大約是 Amazon S3 的 1.25 到 1.75 倍,作者給的結論是:

Though the speed improvements using HDFS are considerable, S3 did perform pretty well. At worst there's a 1.75x overhead in exchange for virtually unlimited scalability, 11 9's of durability and no worrying about over/under-provisioning storage space.

雖然 HDFS 比較快,但 Amazon S3 其實表現的不錯,另外資料安全性 (平均 99.999999999%,也就是 11 個 9 的 durability) 及不需要怕空間不夠的優點也是應該考慮進去的因素。

A Billion Taxi Rides 資料分析系列

Mark Litwintschik 最近在連載 A Billion Taxi Rides 的資料分析系列作品:

同樣的資料 (而且這個資料量夠大,拿來 benchmark 比較有參考價值),用不同的工具分析,對於要挑工具的人可以看一看,另外也因為裡面給了很多 command sample,要自己動手測試也是個很棒的資料...