快速產生 SQLite 資料的方式:一分鐘內產生十億筆資料

在「Towards Inserting One Billion Rows in SQLite Under A Minute」這邊看到作者想要在一分鐘內在 MBP 2019 上面寫 1B 筆資料進 SQLite,裡面有些方法還蠻值得玩一下的,這台 MBP 2019 機器的規格是:

The machine I am using is MacBook Pro, 2019 (2.4 GHz Quad Core i5, 8GB, 256GB SSD, Big Sur 11.1)

第一版是 Python 寫的,塞 10M 筆花了 15 分鐘:

In this script, I tried to insert 10M rows, one by one, in a for loop. This version took close to 15 minutes, sparked my curiosity and made me explore further to reduce the time.

加了五個 PRAGMA 的版本變成 100M 筆十分鐘:

The naive for loop version took about 10 minutes to insert 100M rows.

用批次處理則可以降到八分半:

The batched version took about 8.5 minutes to insert 100M rows.

再來是拿經典神器 PyPy 出來用,降到兩分半:

All I had to do was run my existing code, without any change, using PyPy. It worked and the speed bump was phenomenal. The batched version took only 2.5 minutes to insert 100M rows. I got close to 3.5x speed :)

接下來就是跳槽到 Rust 了,中間也有不少 tuning 相關的討論,但直接先跳到最後面好了... 最後 100M 只用了 33 秒:

I created a threaded version, where I had one writer thread that received data from a channel and four other threads which pushed data to the channel. This is the current best version which took about 32.37 seconds.

能用 PyPy 的地方還是可以考慮一下的...

用 Python 的 DuckDB 下 SQL 指令翻 Parquet 的資料

在「Querying Parquet using DuckDB」這邊看到 DuckDB 這個東西,裡面引用的文章是「Querying Parquet with Precision using DuckDB」,可以直接對 Parquet 格式的資料下 SQL 找資料。

先前好像有看到 DuckDB 但沒有太注意,剛剛再次看到,然後玩了一下還蠻有趣的。DuckDB 支援蠻多程式語言與資料格式,不過這邊文章拿 Python 與 Parquet 玩還蠻有趣的...

先把 Parquet 的範例資料抓下來,然後透過 pip 裝 duckdb:

cd /tmp; wget https://github.com/cwida/duckdb-data/releases/download/v1.0/taxi_2019_04.parquet; pip install -U duckdb

然後進到 Python 3 的互動界面:

>>> import duckdb
>>> print(duckdb.query("SELECT COUNT(*) FROM 'taxi_2019_04.parquet' WHERE pickup_at BETWEEN '2019-04-15' AND '2019-04-20'").fetchall())
[(1276565,)]

然後在範例裡面,檔名的部份還可以用 *,看了一下說明,底層是 glob 類的用法:

DuckDB supports the globbing syntax, which allows it to query all three files simultaneously.

文章裡有提到速度比 Pandas 快很多,不過我覺得這好像不太能這樣比,會拿 Pandas 出來的時候常常是其他用法,但至少看起來速度是個 DuckDB 在意的點。

不過反而馬上想到的是,之後處理 CSV 之類的檔案應該也會試看看 DuckDB...

PostgreSQL 的 Fuzzy Matching

在「Fuzzy Name Matching in Postgres」這邊看到 PostgreSQL 下怎麼設計 Fuzzy Matching 的方式,文章裡用的方法主要是出自 PostgreSQL 的文件:「F.15. fuzzystrmatch」。

文章最後的解法是 Soundex + Levenshtein

翻了一下資料,這個領域另外有 NYSIIS (New York State Identification and Intelligence System):

The New York State Identification and Intelligence System Phonetic Code, commonly known as NYSIIS, is a phonetic algorithm devised in 1970 as part of the New York State Identification and Intelligence System (now a part of the New York State Division of Criminal Justice Services). It features an accuracy increase of 2.7% over the traditional Soundex algorithm.

以及 Metaphone

Metaphone is a phonetic algorithm, published by Lawrence Philips in 1990, for indexing words by their English pronunciation. It fundamentally improves on the Soundex algorithm by using information about variations and inconsistencies in English spelling and pronunciation to produce a more accurate encoding, which does a better job of matching words and names which sound similar. As with Soundex, similar-sounding words should share the same keys. Metaphone is available as a built-in operator in a number of systems.

不過這些都是以英文為主,中文的沒特別翻到...

用 GPT-3 直接產生對應的 SQL query

Hacker News Daily 上看到的東西,直接給 GPT-3 條件,叫 GPT-3 給出對應的 SQL query,這就有點誇張了...:「Automating My Job with GPT-3」。

這樣丟進去:

Instruction: Given an input question, respond with syntactically correct PostgreSQL. Be creative but the SQL must be correct.

Input: how many users signed up in the past month?

然後這樣出來:

GPT-3 Response: SELECT COUNT(*) FROM users
WHERE signup_time > now() - interval '1 month'

把欄位名稱替換掉就可以用了... 而且接下來就更誇張了,直接要求 GPT-3 也要照著設定的 table schema 給答案:

Instruction: Given an input question, respond with syntactically correct PostgreSQL. Be creative but the SQL must be correct. Only use tables called "users" and "charges". The "users" table has columns: id (integer), signup_dt (timestamp), email (character varying), and plan_type (character varying). The "charges" table has columns: amount (bigint), user_id (integer), and charge_dt (timestamp).

Input: how much revenue did we have in the past 7 days?

然後輸出了:

GPT-3 Response: SELECT SUM(amount) FROM charges WHERE charge_dt > now() - interval '7 days'

接下來是在同樣 instruction 下,跨表格的問題:

Input: how much revenue have we had from users that signed up in the last 6 months?

這時候 INNER JOIN 就跑出來了:

.8 Temperature GPT-3 Response: SELECT SUM(charges.amount) FROM users INNER JOIN charges ON users.id = charges.user_id WHERE signup_dt >= DATE_SUB(now(), INTERVAL '6 months')

後面的問題也很精彩,看起來之後可以接上 BI dashboard,直接丟句子進去,然後拉各種資料出來視覺化?

Eventbrite 的 MySQL 升級計畫

在 2021 年看到 EventbiteMySQL 升級計畫:「MySQL High Availability at Eventbrite」。

看起來是 2019 年年初的時候 MySQL 5.1 出問題,後續決定安排升級,在 2019 年年中把系統升級到 MySQL 5.7 (Percona Server 版本):

Our first major hurdle was to get current with our version of MySQL. In July, 2019 we completed the MySQL 5.1 to MySQL 5.7 (v5.7.19-17-log Percona Server to be precise) upgrade across all MySQL instances.

然後看起來是直接在 EC2 上跑,不過這邊提到的空間問題就不太確定了,是真的把 EBS 的空間上限用完嗎?比較常使用的 gp2gp3 上限都是 16TB,不確定是不是真的用到接近爆掉了:

Not only was support for MySQL 5.1 at End-of-Life (more than 5 years ago) but our MySQL 5.1 instances on EC2/AWS had limited storage and we were scheduled to run out of space at the end of July. Our backs were up against the wall and we had to deliver!

另外在升級到 5.7 的時候,順便把本來是 INT 的 primary key 都換成 BIGINT

As part of the cut-over to MySQL 5.7, we also took the opportunity to bake in a number of improvements. We converted all primary key columns from INT to BIGINT to prevent hitting MAX value.

然後系統因為舊版的 Django 沒辦法配合 MySQL 5.7,得升級到 Django 1.6 (要注意 Django 1 系列的最新版是 1.11,看起來光是升級到 1.6 勉強會動就升不上去了?):

In parallel with the MySQL 5.7 upgrade we also Upgraded Django to 1.6 due a behavioral change in MySQL 5.7 related to how transactions/commits were handled for SELECT statements. This behavior change was resulting in errors with older version of Python/Django running on MySQL 5.7

然後採用了 GitHub 家研發的 gh-ost 當作改變 schema 的工具:

In December 2019, the Eventbrite DBRE successfully implemented a table ALTER via gh-ost on one of our larger MySQL tables.

看起來主要的原因是有遇到 pt-online-schema-change 的限制 (在「GitHub 發展出來的 ALTER TABLE 方式」這邊有提到):

Eventbrite had traditionally used pt-online-schema-change (pt-osc) to ALTER MySQL tables in production. pt-osc uses MySQL triggers to move data from the original to the “duplicate” table which is a very expensive operation and can cause replication lag. Matter of fact, it had directly resulted in several outages in H1 of 2019 due to replication lag or breakage.

另外一個引入的技術是 Orchestrator,看起來是先跟 HAProxy 搭配,不過他們打算要再換到 ProxySQL

Next on the list was implementing improvements to MySQL high availability and automatic failover using Orchestrator. In February of 2020 we implemented a new HAProxy layer in front of all DB clusters and we released Orchestrator to production!

Orchestrator can successfully detect the primary failure and promote a new primary. The goal was to implement Orchestrator with HAProxy first and then eventually move to Orchestrator with ProxySQL.

然後最後題到了 Square 研發的 Shift,把 gh-ost 包裝起來變成有個 web UI 可以操作:

2021 還可以看到這類文章還蠻有趣的...

產生名次的 SQL

Percona 的「Generating Numeric Sequences in MySQL」這篇在討論產生字串序列,主要是在 MySQL 環境下,裡面看到的技巧「Session Variable Increment Within a SELECT」這組,剛好可以用在要在每個 row 裡面增加名次:

SELECT (@val := @val + 1) - 1 AS value FROM t1, (SELECT @val := 0) AS tt;

另外看到 MariaDBMySQL 8.0 系列因為有多支援各種功能,剛好也可以被拿來用,然後最後也提到了 Percona 自家出的 MySQL 8.0.20-11 將會直接有 SEQUENCE_TABLE() 可以用 (這應該才是 Percona 這篇文章的主要目的,推銷一下自家產品的新功能)。

文章收起來之後遇到可以拿出來參考用...

Amazon DocumentDB 推出相容 MongoDB 4.0 的版本

在「Amazon DocumentDB (with MongoDB compatibility) adds support for MongoDB 4.0 and transactions」這邊看到 AWSAmazon DocumentDB 上推出相容 MongoDB 4.0 的版本。

把年初在 Ptt 上寫的「Re: [請益] 選擇mongoDB或是relational database ??」這篇拿出來講一下,MongoDB 4.0 最大的改進就是 multi-document transactions 了。

不過 AWS 先前推出 DocumentDB (MongoDB) 時看到的限制,大家都猜測是用 PostgreSQL 當底層 (「AWS 推出 MongoDB 服務:Amazon DocumentDB」與「大家在猜 Amazon DocumentDB 的底層是不是 PostgreSQL...」),雖然目前還是不太清楚,但如果這個猜測屬實的話,要推出各種 transaction 的支援完全不是問題 XDDD

Percona 對 MongoDB 的建議

看到「5 Things DBAs Should Know Before Deploying MongoDB」這篇,裡面給了五個建議,其中第五點頗有趣:

5) Whenever Possible, Working Set < RAM

As with any database, fitting your data into RAM will allow for faster reads than from disk. MongoDB is no different. Knowing how much data MongoDB has to read in for your queries can help you determine how much RAM you should allocate to your database.

這樣的設計邏輯很奇怪啊,你不要扯其他 database 啊,你們家主力的 InnoDB 一直都沒有推薦要 Working Set < RAM 啊,反過來才是用 InnoDB 的常態吧,而且在 PostgreSQL 上也是這樣吧 XDDD

現在上面的文章真的是挑著看了... XD

EnterpriseDB 買下 2ndQuadrant

算是 PostgreSQL 社群裡面的大事情,看到大老在討論 EnterpriseDB (EDB) 買下 2ndQuadrant 的事情:「Community Impact of 2nd Quadrant Purchase」,這兩家公司都是 PostgreSQL 社群裡面重量級的台柱。

先翻了一下新聞稿,兩邊的官方新聞稿分別是「How EDB Became the Leader in the Postgres Market」與「How EDB Became the Leader in the Postgres Market」。

回到原來的文章,裡面提到了 core team 的不成文規定,這個部份可以從 Contributor Profiles 這邊看到目前 core team 有五位成員,Peter Eisentraut 來自 2ndQuadrant,而 Bruce Momjian (這是文章作者自己) 與 Dave Page 則是來自 EnterpriseDB:

First, there is an unwritten rule that the Postgres core team should not have over half of its members from a single company, and the acquisition causes edb's representation in the core team to be 60% — the core team is working on a solution for this.

裡面有提到目前正在找辦法解決中,但不知道目前會怎麼解決,讓出位置可能是一個方法,加到七個人應該也是個方法,反正方法不算少,就等著看...

另外他提出來的兩個問題我覺得都還好,就是併購本來就會發生的事情。

這次的併購算是 PostgreSQL 社群裡面蠻熱鬧的事情,雖然是商業公司之間的併購,但社群這邊應該也會有不少變化...

PostgreSQL 13 的 B-Tree Deduplication

Hacker News 上看到「Lessons Learned from Running Postgres 13: Better Performance, Monitoring & More」這篇文章,其中有提到 PostgreSQL 13 因為 B-Tree 支援 deduplication,所以有機會縮小不少空間。

搜了一下源頭是「Add deduplication to nbtree.」這個 git commit,而 PostgreSQL 官方的說明則是在「63.4.2. Deduplication」這邊可以看到。

另外值得一提的是,這個功能在 CREATE INDEX 這頁可以看到在 PostgreSQL 13 預設會打開使用。

依照說明,看起來本來的機制是當 B-Tree index 內的 key 相同時,像是 key1 = key2 = key3 這樣,他會存 {key1, ptr1}{key2, ptr2}{key3, ptr3}

在新的架構下開啟 deduplication 後就會變成類似 {key1, [ptr1, ptr2, ptr3]} 這樣的結構。可以看出來在 key 重複的資料很多的時候,可以省下大量空間 (以術語來說的話,就是 cardinality 偏低的時候)。

這樣看起來可以降低不少壓力...