把 RabbitMQ 換成 PostgreSQL 的那篇文章...

Hacker News 上看到「SQL Maxis: Why We Ditched RabbitMQ and Replaced It with a Postgres Queue (prequel.co)」這篇文章,原文在「SQL Maxis: Why We Ditched RabbitMQ And Replaced It With A Postgres Queue」這邊,裡面在講他們把 RabbitMQ 換成 PostgreSQL 的前因後果。

文章裡面可以吐嘈的點其實蠻多的,而且在 Hacker News 上也有被點出來,像是有人就有提到他們遇到了 bug (或是 feature) 卻不解決 bug,而是決定直接改寫成用 PostgreSQL 來解決,其實很怪:

In summary -- their RabbitMQ consumer library and config is broken in that their consumers are fetching additional messages when they shouldn't. I've never seen this in years of dealing with RabbitMQ. This caused a cascading failure in that consumers were unable to grab messages, rightfully, when only one of the messages was manually ack'ed. Fixing this one fetch issue with their consumer would have fixed the entire problem. Switching to pg probably caused them to rewrite their message fetching code, which probably fixed the underlying issue.

另外一個吐嘈的點是量的部份,如果就這樣的量,用 PostgreSQL 降低使用的 tech stack 應該是個不錯的決定 (但另外一個問題就是,當初為什麼要導入 RabbitMQ...):

>To make all of this run smoothly, we enqueue and dequeue thousands of jobs every day.

If you your needs aren't that expensive, and you don't anticipate growing a ton, then it's probably a smart technical decision to minimize your operational stack. Assuming 10k/jobs a day, thats roughly 7 jobs per minute. Even the most unoptimized database should be able to handle this.

在同一個 thread 下面也有人提到這個量真的很小,甚至直接不講武德提到可以用 Jenkins 解 XD:

Years of being bullshitted have taught me to instantly distrust anyone who is telling me about how many things they do per day. Jobs or customers per day is something to tell you banker, or investors. For tech people it’s per second, per minute, maybe per hour, or self aggrandizement.

A million requests a day sounds really impressive, but it’s 12req/s which is not a lot. I had a project that needed 100 req/s ages ago. That was considered a reasonably complex problem but not world class, and only because C10k was an open problem. Now you could do that with a single 8xlarge. You don’t even need a cluster.

10k tasks a day is 7 per minute. You could do that with Jenkins.

然後意外看到 Simon Willison 提到了一個重點,就是 RabbitMQ 到現在還是不支援 ACID 等級的 job queuing (尤其是 Durability 的部份),也就是希望 MQ 系統回報成功收到的 task 一定會被處理:

The best thing about using PostgreSQL for a queue is that you can benefit from transactions: only queue a job if the related data is 100% guaranteed to have been written to the database, in such a way that it's not possible for the queue entry not to be written.

Brandur wrote a great piece about a related pattern here: https://brandur.org/job-drain

He recommends using a transactional "staging" queue in your database which is then written out to your actual queue by a separate process.

這也是當年為什麼用 MySQL 幹類似的事情,要 ACID 的特性來確保內容不會掉。

這也是目前我覺得唯一還需要用 RDBMS 當 queue backend 的地方,但原文公司的想法就很迷,遇到 library bug 後決定換架構,而不是想辦法解 bug,還很開心的寫一篇文章來宣傳...

Amazon DocumentDB 支援 MongoDB 5.0 協定

看到「Amazon DocumentDB (with MongoDB compatibility) adds support for MongoDB 5.0 wire protocol and client-side field level encryption」這篇,Amazon DocumentDB 推出了 MongoDB 5.0 的支援。

MongoDB 5.0 是 2021 年七月的版本,對使用者端比較重要的看起來是 PII 處理與 TSDB 的支援。

比較感興趣的是,TSDB 的部份是繼續用 PostgreSQL 魔改嗎...?

Etsy 使用 Vitess 的過程

Etsy 寫了三偏關於使用 Vitess 解決資料庫效能問題的文章:「Scaling Etsy Payments with Vitess: Part 1 – The Data Model」、「Scaling Etsy Payments with Vitess: Part 2 – The “Seamless” Migration」、「Scaling Etsy Payments with Vitess: Part 3 – Reducing Cutover Risk」。

Vitess 是 YouTube 團隊開發出來的東西,試著透過一層 proxy 解決後端 MySQL 資料庫在 sharding 後查詢邏輯的問題。


首先是現代暴力解的能耐,從維基百科可以查到 Etsy 在 2015 年就上市了,但到了 2020 年年底撞到 vertically scaling 的天花板 (這邊是指 GCP 的上限),可以看到現在的暴力法可以撐超久... 如果再多考慮到實體機房的話應該可以找到更大台的機器。

第二個是 Etsy 在 2020 年年底開始從資料庫搬資料,一路到 2022 年五月,算起來差不多搬了一年半,總共轉移了 4 個 database 到 Vitess 的 cluster 上,共 23 張表格與 40B rows。

第三個是利用 Vindexes 這個技術降低 sharding 時所帶來的限制。這個之前沒研究過:

A Vindex provides a way to map a column value to a keyspace ID.

從「Older Version Docs」這邊翻舊版的文件,發現 5.0+ 都有,再往 GitHub 上面的資料翻,看起來從 2016 年的版本就有了,不過當時看起來還一直在擴充:「Vitess v2.0.0-rc.1」。

回來看現在的功能,有 primary vindex 的設計:

The Primary Vindex for a table is analogous to a database primary key. Every sharded table must have one defined. A Primary Vindex must be unique: given an input value, it must produce a single keyspace ID.

然後是 secondary vindex(es) 的設計,指到 keyspace id(s),然後這個資訊會被用在 routing 上:

Secondary Vindexes are additional vindexes against other columns of a table offering optimizations for WHERE clauses that do not use the Primary Vindex. Secondary Vindexes return a single or a limited set of keyspace IDs which will allow VTGate to only target shards where the relevant data is present. In the absence of a Secondary Vindex, VTGate would have to send the query to all shards (called a scatter query).

It is important to note that Secondary Vindexes are only used for making routing decisions. The underlying database shards will most likely need traditional indexes on those same columns, to allow efficient retrieval from the table on the underlying MySQL instances.

然後是 functional vindex 與 lookup vindex,前者用演算法定義 keyspace id,後者讓你查:

A Functional Vindex is a vindex where the column value to keyspace ID mapping is pre-established, typically through an algorithmic function. In contrast, a Lookup Vindex is a vindex that provides the ability to create an association between a value and a keyspace ID, and recall it later when needed. Lookup Vindexes are sometimes also informally referred to as cross-shard indexes.

然後 lookup vindex 還有對 consistent hashing 的支援:

Consistent lookup vindexes use an alternate approach that makes use of careful locking and transaction sequences to guarantee consistency without using 2PC. This gives the best of both worlds, with the benefit of a consistent cross-shard vindex without paying the price of 2PC. To read more about what makes a consistent lookup vindex different from a standard lookup vindex read our consistent lookup vindexes design documentation.

這樣整體看起來,Vitess 把所有常見的 sharding 方式都包進去了,如果以後真的遇到這個量的話,也不需要自己在 application 或是 library 做一堆事情了...

Pony ORM

Simon Willison 的 blog 上看到的東西:「Python’s “Disappointing” Superpowers」,裡面提到的原文是「Python’s “Disappointing” Superpowers」這篇,在講 Python 的工具。

雖然是說「disappointing」,但實際上是反義,在原文裡面提到了很多特別的工具,其中 Pony ORM 算是我覺得最有趣的了,他的寫法就非常的 Python:

select(c for c in Customer if sum(c.orders.price) > 1000)

也可以用 lambda 的形式來寫:

Customer.select(lambda c: sum(c.orders.total_price) > 1000)

這樣會產生出對應的 SQL:

SELECT "c"."id"
FROM "customer" "c"
  LEFT JOIN "order" "order-1"
    ON "c"."id" = "order-1"."customer"
GROUP BY "c"."id"
HAVING coalesce(SUM("order-1"."total_price"), 0) > 1000

不會產生 syntax error 的原因是因為他直接解讀 bytecode 分析,產生出對應的 SQL query:

A normal understanding of generator expressions suggests that the select function is consuming a generator. But that couldn’t explain the behaviour here. Instead, it actually introspects the frame object of the calling code, then decompiles the byte code of the generator expression object it finds, and builds a Query based on the AST objects.


看了一下也有一些 integration,像是 Flask 的「Integration with flask」與 FastAPI 的「Integration with FastAPI」。

不過應該是先看看,目前 Python 上用的主力還是 Django,有自己的 ORM 架構...

SQLite 的 HC-tree 計畫

Hacker News 首頁上看到的新計畫:「HC-tree is an experimental high-concurrency database back end for SQLite (sqlite.org)」,SQLite 弄了一個實驗性質的 backend,叫做 HC-tree

The HC-tree (hctree) project is an attempt to develop a new database backend that improves upon regular SQLite as follows:

他列了幾個重點,其中「Improved concurrency」這點題到了可以讓多個 writer 同時寫入運作,這點算是 SQLite 很大的改變,目前希望可以做到在 single-threaded 情況下不輸現有的 SQLite:

An implicit goal is that hctree must be as fast or faster than stock SQLite for all single-threaded cases. There is no point in running dozens of concurrent writers if each of them is an order of magnitude slower than a single writer writing to a legacy database.

另外一方面,這算是 SQLite 真正要面對資料庫的 isolation 的問題了,比起現在的版本,同時間從只有一個 writer 的架構要變成支援多個 writer 的架構,所以在「Concurrency Model」這邊也帶了一下他預期可以做到的事情。

然後這邊可以看到在解釋裡面有提到 table 與 index 還是 b-tree,這樣應該可以猜測 hctree 的實做方式應該還是在市場上已經很成熟的 MVCC 那套方法:

If no other client has modified any b-tree (table or index) entry or range that the transaction being committed accessed by a range or stabbing query, then the transaction is valid.

另外一個蠻大的改變是「Support for replication」,現有的 SQLite 可以透過 extension 的方式加掛支援 replication 功能,現在則是讓底層的 backend 直接支援。

底層支援新的 backend 以後看起來會有不少變化可以玩,第一個想到的當然是變成 server 類型的服務,也就是像 MySQL 或是 PostgreSQL 這樣的方式,另外一種方向是包裝成 distributed database,讓應用程式可以簡單跨機器使用。

目前還不知道會往什麼方向走就是了,也有可能 SQLite 這邊只實做 backend,上面讓大家發揮...

資料庫在 2022 年的發展

Hacker News 上看到這篇對 2022 年資料庫發展的回顧文章,可以補一些沒看到的新聞與發展:「Databases in 2022: A Year in Review」,作者 Andy PavloOtterTune 的創辦人,有些他的想法會有些偏見,就當作一方之言來看就好。另外在 Hacker News 上的討論則是冒出一堆創辦人出來替自己公司的產品介紹一番:「Databases in 2022: A Year in Review (ottertune.com)」。

首先是他提到的 ClickHouse,我是 2022 年才開始關注,而且發現很不賴。查了一下維基百科,在 2021 年十月的時候搞了 round B $250m,估值 $2b:

On October 28, 2021 the company received Series B funding totaling $250 million at an valuation of $2 billion from Coatue Management, Altimeter Capital, and other investors.

另外是 Meta 弄出來的 Velox,直接看官網上面的圖可以看到,他試著把 Database、PrestoSpark 的 engine/worker 層抽換掉:

GitHub 的頁面說明上也可以看出來,Velox 是提供很多已經最佳化過的界面 (包括了 Type、Vector、Expression Eval、Function Packages、Operators、I/O、Network Serializers 與 Resource Management) 讓 engine/worker 直接使用,避免了自己的實做沒有最佳化:

Velox is a C++ database acceleration library which provides reusable, extensible, and high-performance data processing components.

In common usage scenarios, Velox takes a fully optimized query plan as input and performs the described computation. Considering Velox does not provide a SQL parser, a dataframe layer, or a query optimizer, it is usually not meant to be used directly by end-users; rather, it is mostly used by developers integrating and optimizing their compute engines.


WordPress 打算要支援 SQLite 作為後端資料庫

目前 WordPress 只有支援 MySQL,而昨天在 Hacker News 上看到 WordPress 有打算要支援 SQLite 作為後端資料庫的消息:「WordPress testing official SQLite Support (github.com/wordpress)」,原文在 GitHub 上:「Implement new experimental SQLite integration module」。

理論上對使用者會更方便,但對 extension 開發者會麻煩一些 (或是直接標不支援?),尤其是用到 MySQL 特有的語法就要注意了。

實質上 PHP + MySQL hosting 其實蠻常見的,這個作法有多少幫助就不知道了。

但突然想到,如果做一個 read-only 版本的 WordPress 站台,然後把 SQLite 的讀取部份改用 sql.js 之類的計畫,再把一堆 server side rendering 的部份變成 client side rendering,好像有機會可以整包直接上 GitHub Pages 之類的服務?雖然這樣有點拖褲子放屁...

SQLite 官方自己下來搞 WASM/JS 計畫

先前在「把 SQLite 的 VFS 掛上 WebTorrent 的 PoC Demo」有提過 sql.js 這個專案,把 SQLite 移植到網頁上,這些都算是非官方的社群弄出來的專案。

現在官方直接跳下來玩,宣佈自己也要搞 WASM/JS 了:「sqlite3 wasm docs: About the sqlite3 WASM/JS Subproject」。

Folks have been building sqlite3 for the web since as far back as 2012 but this subproject is the first effort "officially" associated with the SQLite project, created with the goal of making WASM builds of the library first-class members of the family of supported SQLite deliverables.

但不太確定 D. Richard Hipp 的想法,官方支援 WASM/JS 的目的會是什麼?放給社群繼續發展有什麼問題嗎...

PostgreSQL 15 釋出

PostgreSQL 15 出了:「PostgreSQL 15 Released!

先前提到過「PostgreSQL 15 將可以對透過 UNIQUE 限制 NULL 的唯一性了」,反而沒排上這次 release 的重點,翻了一下的確是排不太上 XD

第一個超大的改善是 sorting:

In this latest release, PostgreSQL improves on its in-memory and on-disk sorting algorithms, with benchmarks showing speedups of 25% - 400% based on which data types are sorted.

在「Speeding up sort performance in Postgres 15」這邊有提到四個改動,裡面很詳細的說明了改動的內容,以及 benchmark 差異。

如果以他列出來的四個進展,應該是第二個「Reduce memory consumption by using generation memory context」這個會最容易遇到,也改善最多:

另外是第三個「Add specialized sort routines for common datatypes」也會有一些:

再來是拿 PostgreSQL 當 OLAP engine 用的時候會發生的第四個「Replace polyphase merge algorithm with k-way merge」:

最開頭第一個「Improvements sorting a single column」的 SELECT col1 FROM tab ORDER BY col1; 這種 case 好像用的很少,限制 SELECT 的部份也只能出現後面 sorting 的 column,但如果遇到的話效能提昇很多:

除了 sorting 的改善以外,另外一個是 WAL 支援 LZ4zstd,這對於有寫入量很大的環境應該會有幫助:

PostgreSQL 15 adds support for LZ4 and Zstandard (zstd) compression to write-ahead log (WAL) files, which can have both space and performance benefits for certain workloads.

正式版出來後,應該會有一些整體性的 benchmark 數字可以看,再來等著看...

Apple 使用 Cassandra 的量

Hacker News Daily 上看到的:「Cassandra at Apple: 1000s of Clusters, 300k Nodes, 100 PB (twitter.com/erickramirezau)。原文在 Twitter 上:

有些數字有點對不太起來,裡面提到 300K nodes + millions of QPS,但通常讀寫都算 QPS,這樣聽起來很少?所以有種可能這邊是只有算 read 的部份...

另外照片裡面提到 Over two petabytes per cluster,但有 thousands of clusters,最後卻只有 Hundreds of petabytes of data,完全對不上,就算當作平均值來算也對不上,只能猜測是最大的 cluster 而不是 per cluster。

裡面矛盾的地方太多,所以這些數字基本上沒有參考價值,現在能讀出來的只知道 Apple 有在用 Cassandra,然後不是少少幾台 PoC 等級的使用。