把 RabbitMQ 換成 PostgreSQL 的那篇文章...

Hacker News 上看到「SQL Maxis: Why We Ditched RabbitMQ and Replaced It with a Postgres Queue (prequel.co)」這篇文章,原文在「SQL Maxis: Why We Ditched RabbitMQ And Replaced It With A Postgres Queue」這邊,裡面在講他們把 RabbitMQ 換成 PostgreSQL 的前因後果。

文章裡面可以吐嘈的點其實蠻多的,而且在 Hacker News 上也有被點出來,像是有人就有提到他們遇到了 bug (或是 feature) 卻不解決 bug,而是決定直接改寫成用 PostgreSQL 來解決,其實很怪:

In summary -- their RabbitMQ consumer library and config is broken in that their consumers are fetching additional messages when they shouldn't. I've never seen this in years of dealing with RabbitMQ. This caused a cascading failure in that consumers were unable to grab messages, rightfully, when only one of the messages was manually ack'ed. Fixing this one fetch issue with their consumer would have fixed the entire problem. Switching to pg probably caused them to rewrite their message fetching code, which probably fixed the underlying issue.

另外一個吐嘈的點是量的部份,如果就這樣的量,用 PostgreSQL 降低使用的 tech stack 應該是個不錯的決定 (但另外一個問題就是,當初為什麼要導入 RabbitMQ...):

>To make all of this run smoothly, we enqueue and dequeue thousands of jobs every day.

If you your needs aren't that expensive, and you don't anticipate growing a ton, then it's probably a smart technical decision to minimize your operational stack. Assuming 10k/jobs a day, thats roughly 7 jobs per minute. Even the most unoptimized database should be able to handle this.

在同一個 thread 下面也有人提到這個量真的很小,甚至直接不講武德提到可以用 Jenkins 解 XD:

Years of being bullshitted have taught me to instantly distrust anyone who is telling me about how many things they do per day. Jobs or customers per day is something to tell you banker, or investors. For tech people it’s per second, per minute, maybe per hour, or self aggrandizement.

A million requests a day sounds really impressive, but it’s 12req/s which is not a lot. I had a project that needed 100 req/s ages ago. That was considered a reasonably complex problem but not world class, and only because C10k was an open problem. Now you could do that with a single 8xlarge. You don’t even need a cluster.

10k tasks a day is 7 per minute. You could do that with Jenkins.

然後意外看到 Simon Willison 提到了一個重點,就是 RabbitMQ 到現在還是不支援 ACID 等級的 job queuing (尤其是 Durability 的部份),也就是希望 MQ 系統回報成功收到的 task 一定會被處理:

The best thing about using PostgreSQL for a queue is that you can benefit from transactions: only queue a job if the related data is 100% guaranteed to have been written to the database, in such a way that it's not possible for the queue entry not to be written.

Brandur wrote a great piece about a related pattern here: https://brandur.org/job-drain

He recommends using a transactional "staging" queue in your database which is then written out to your actual queue by a separate process.

這也是當年為什麼用 MySQL 幹類似的事情,要 ACID 的特性來確保內容不會掉。

這也是目前我覺得唯一還需要用 RDBMS 當 queue backend 的地方,但原文公司的想法就很迷,遇到 library bug 後決定換架構,而不是想辦法解 bug,還很開心的寫一篇文章來宣傳...

RabbitMQ 也進來搶 Streaming Engine 這塊市場了

RabbitMQ 要在 3.9 版推出 Streams (目前 3.9 還在 RC):「RabbitMQ Streams Overview」,在 Hacker News 上有一些討論可以看:「RabbitMQ Streams Overview (rabbitmq.com)」。

這樣 RabbitMQ 可以同時有 queue 與 stream 的能力,對於一些小專案應該會方便不少,算是打隔壁棚 Kafka 的痛點之一,弄一組 HA cluster 至少要三台 ZooKeeper 加上兩台 Broker,基礎建設還要有對應的 HA load balancer 機制,在 AWS 上可以拿 ELB 偷懶,如果是實體機房的話也許用 F5 擋,或是用 open source 的 Keepalived (或是老一點的 Heartbeat) + nginx

目前還不知道 scalability 的能耐,不過印象中在 queue 的單台 throughput 不差,如果沒有意外的話,streaming 的部份對小型專案應該夠用。

Amazon MQ 支援 RabbitMQ

Amazon MQ 本來只支援 ActiveMQ,剛剛看到消息支援 RabbitMQ 了:「Announcing Amazon MQ for RabbitMQ」。

用的版本還算可以,查了一下 3.8.6 是今年八月出的:

Amazon MQ currently supports RabbitMQ version 3.8.6 and has support for version upgrades.

基本上 ActiveMQ 版本與 RabbitMQ 版本的價錢相同,不過 RabbitMQ 看起來就只支援比較新的機器了,像是 t2 系列的機器基本上就不支援了。


對各類 Message Queue 的效能測試

在「Benchmarking Message Queue Latency」這篇看到作者測了一輪 Message Queue 軟體:

RabbitMQ (3.6.0), Kafka ( and, Redis (2.8.4) pub/sub, and NATS (0.7.3)

測試包括了從一個 9 到六個 9 的 latency (i.e. 90%、99%、99.9%、99.99%、99.999%、99.9999%),另外也測了 message 大小帶來的效能差異。

99.9% 表示 1/1000,而 99.99% 表示 1/10000,如果差距跟 90% 很大,表示系統反應時間會很不一致。另外有些 Message Queue 軟體有 disk persistence 的功能,也因為寫入資料,會看到更大的差距。

善用或是避開這些特性去規劃才能減少問題,像是關掉 disk persistence 之類的方法。