Jepsen 回過頭來測試 MySQL 8.0

Hacker News 上看到作者自己貼的:「Jepsen: MySQL 8.0.34 (jepsen.io)」,原文在「MySQL 8.0.34」。

這次的測試不是 Oracle 付費讓 Jepsen 測,而是 Jepsen 這邊自己回頭測試 MySQL 8.0:

This work was performed independently without compensation, and conducted in accordance with the Jepsen ethics policy.

然後意外的流彈 (或是榴彈?) 打下了 AWSRDS,測出 RDS 在 cluster 模式下無法達到 SERIALIZABILITY

As a lagniappe, we show that AWS RDS MySQL clusters routinely violate Serializability.

然後 MySQL 本體則是找到 REPEATABLE-READ (預設 isolation level) 的問題:

Using our transaction consistency checker Elle, we show that MySQL Repeatable Read also violates internal consistency. Furthermore, it violates Monotonic Atomic View: transactions can observe some of another transaction’s effects, then later fail to observe other effects of that same transaction. We demonstrate violations of ANSI SQL’s requirements for Repeatable Read.

文章的前面一大段在寫歷史,解釋 ANSI 當初的 SQL 標準在定義 isolation level 時寫的很差,導致有很多不同的解讀,而且即使到了 SQL:2023 也還是沒有改善。

接著則是提到各家資料庫宣稱的 isolation level 跟 ANSI 定義的又不一樣的問題... (包括了無論怎麼解讀 ANSI 定義的情況)

不過中間有提到 1999 年 Atul Adya 試著正式定義 isolation level,把本來的四個 isolation level 用更嚴謹的方法重新給出相容的定義,這看起來是作者推薦在一般狀況下的替代方案:

In 1999, Atul Adya built on Berenson et al.’s critique and developed formal and implementation-independent definitions of various transaction isolation levels, including those in ANSI SQL. As he notes[.]

這四個會是 PL-1 對應到 READ UNCOMMITTEDPL-2 對應到 READ COMMITTEDPL-2.99 對應到 REPEATABLE-READ,以及 PL-3 對應到 SERIALIZABILITY;而其中 PL-2.99REPEATABLE-READ 在後面也會重複出現多次。

這次比較意外是在單機上找出問題來,至於 RDS 的部分反倒不是太意外,因為知道 AWS 在底層做了不少 hack,總是會有些 trade off 的?

把 RabbitMQ 換成 PostgreSQL 的那篇文章...

Hacker News 上看到「SQL Maxis: Why We Ditched RabbitMQ and Replaced It with a Postgres Queue (prequel.co)」這篇文章,原文在「SQL Maxis: Why We Ditched RabbitMQ And Replaced It With A Postgres Queue」這邊,裡面在講他們把 RabbitMQ 換成 PostgreSQL 的前因後果。

文章裡面可以吐嘈的點其實蠻多的,而且在 Hacker News 上也有被點出來,像是有人就有提到他們遇到了 bug (或是 feature) 卻不解決 bug,而是決定直接改寫成用 PostgreSQL 來解決,其實很怪:

In summary -- their RabbitMQ consumer library and config is broken in that their consumers are fetching additional messages when they shouldn't. I've never seen this in years of dealing with RabbitMQ. This caused a cascading failure in that consumers were unable to grab messages, rightfully, when only one of the messages was manually ack'ed. Fixing this one fetch issue with their consumer would have fixed the entire problem. Switching to pg probably caused them to rewrite their message fetching code, which probably fixed the underlying issue.

另外一個吐嘈的點是量的部份,如果就這樣的量,用 PostgreSQL 降低使用的 tech stack 應該是個不錯的決定 (但另外一個問題就是,當初為什麼要導入 RabbitMQ...):

>To make all of this run smoothly, we enqueue and dequeue thousands of jobs every day.

If you your needs aren't that expensive, and you don't anticipate growing a ton, then it's probably a smart technical decision to minimize your operational stack. Assuming 10k/jobs a day, thats roughly 7 jobs per minute. Even the most unoptimized database should be able to handle this.

在同一個 thread 下面也有人提到這個量真的很小,甚至直接不講武德提到可以用 Jenkins 解 XD:

Years of being bullshitted have taught me to instantly distrust anyone who is telling me about how many things they do per day. Jobs or customers per day is something to tell you banker, or investors. For tech people it’s per second, per minute, maybe per hour, or self aggrandizement.

A million requests a day sounds really impressive, but it’s 12req/s which is not a lot. I had a project that needed 100 req/s ages ago. That was considered a reasonably complex problem but not world class, and only because C10k was an open problem. Now you could do that with a single 8xlarge. You don’t even need a cluster.

10k tasks a day is 7 per minute. You could do that with Jenkins.

然後意外看到 Simon Willison 提到了一個重點,就是 RabbitMQ 到現在還是不支援 ACID 等級的 job queuing (尤其是 Durability 的部份),也就是希望 MQ 系統回報成功收到的 task 一定會被處理:

The best thing about using PostgreSQL for a queue is that you can benefit from transactions: only queue a job if the related data is 100% guaranteed to have been written to the database, in such a way that it's not possible for the queue entry not to be written.

Brandur wrote a great piece about a related pattern here: https://brandur.org/job-drain

He recommends using a transactional "staging" queue in your database which is then written out to your actual queue by a separate process.

這也是當年為什麼用 MySQL 幹類似的事情,要 ACID 的特性來確保內容不會掉。

這也是目前我覺得唯一還需要用 RDBMS 當 queue backend 的地方,但原文公司的想法就很迷,遇到 library bug 後決定換架構,而不是想辦法解 bug,還很開心的寫一篇文章來宣傳...

MongoDB 的欺騙性廣告

Jepsen 最近丟出了一篇新的測試報告在測新版的 MongoDB 4.2.6,而且語氣看起來比以前兇很多,翻了一下前因後果,看起來起因是出自 Twitter 上的這則推,提到了 MongoDB 拿 Jepsen 宣傳的頁面:

然後 Jepsen 的官方帳號這邊也回應,覺得不可置信:

過兩個禮拜後 Jepsen 就丟出由老大 Kyle Kingsbury 發表的「Jepsen: MongoDB 4.2.6」,這篇測試 MongoDB 4.2.6 最新版的測試報告了。

在這篇報告裡面提到了很多不道德的行為,首先是在之前的測試發現有很多會掉資料的問題,但在 MongoDB 官方的宣傳文件「MongoDB and Jepsen」裡面則是完全沒提到,而且還宣稱有業界最強的資料一致性與正確性 (與 Jepsen 報告所提供的資料不符),所以 Jepsen 建議把這些問題列到這個頁面上,以避免使用者受到「誤解」:

Curiously, MongoDB omitted any mention of these findings in their MongoDB and Jepsen page. Instead, that page discusses only passing results, makes no mention of read or write concern, buries the actual report in a footnote, and goes on to claim:

MongoDB offers among the strongest data consistency, correctness, and safety guarantees of any database available today.

We encourage MongoDB to report Jepsen findings in context: while MongoDB did appear to offer per-document linearizability and causal consistency with the strongest settings, it also failed to offer those properties in most configurations. We think users might want to be aware that their database could lose data by default, but MongoDB’s summary of our work omits any mention of this behavior.

另外當然就是重測 MongoDB 4.2.6 版,沒時間看內容的人可以先瞄一下標題,裡面就已經點出不少東西了:

3 Results
3.1 Sometimes, Programs That Use Transactions… Are Worse
3.2 How ACID is Snapshot Isolation, Anyway
3.3 Indeterminate Errors
3.4 Duplicate Effects
3.5 Read Skew
3.6 Cyclic Information Flow
3.7 Read Your (Future) Writes

不過在最後面的 Discussion 比較清楚。

首先是批評 snapshot isolation 不是 ACID:

MongoDB 4.2.6 claims to offer “full ACID transactions” via snapshot isolation. However, the use of these transactions is complicated by weak defaults, confusing APIs, and undocumented error codes. Snapshot isolation is questionably compatible with the marketing phrase “full ACID”. Even at the highest levels of read and write concern, MongoDB’s transaction mechanism exhibited various anomalies which violate snapshot isolation.

Snapshot isolation is a reasonably strong consistency model, but claiming that snapshot isolation is “full ACID” is questionable.

而且即使把所有的資料安全性相關的設定都調到最高,也根本就做不到宣稱的 snapshot isolation:

Finally, even with the strongest levels of read and write concern for both single-document and transactional operations, we observed cases of G-single (read skew), G1c (cyclic information flow), duplicated writes, and a sort of retrocausal internal consistency anomaly: within a single transaction, reads could observe that transaction’s own writes from the future. MongoDB appears to allow transactions to both observe and not observe prior transactions, and to observe one another’s writes. A single write could be applied multiple times, suggesting an error in MongoDB’s automatic retry mechanism. All of these behaviors are incompatible with MongoDB’s claims of snapshot isolation.

過程中也發現就算設定了 snapshot 層級,MongoDB 在讀取時也不會遵守 snapshot isolation:

MongoDB’s default read and write concern for single-document operations remains local, which can observe uncommitted data, and w: 1, which can lose committed writes. Even when users select safer settings in their clients at the database or collection level, transactions ignore these settings and default again to local and w: 1. The snapshot read concern does not actually guarantee snapshot isolation, and must always be used in conjunction with write concern majority. This holds even for transactions which perform no writes.

然後所有的官方文件都沒有教 snapshot isolation 要怎麼設定,你必須在第三方的文件上才有機會找到:

Nor can users rely on examples to demonstrate snapshot isolated behavior. MongoDB’s transaction documentation and tutorial blog posts show only write-only transactions, using read concern local rather than snapshot. Other examples from MongoDB don’t specify a read concern or run entirely with defaults. Learn MongoDB The Hard Way uses read concern snapshot but write concern local, despite performing writes. Tutorials from DZone, Several Nines, Percona, The Code Barbarian, and Spring.io all claim that transactions are either ACID or offer snapshot isolation, but none set either read or write concern. There are some examples of MongoDB transactions which are snapshot isolated—for instance, from BMC, +N Consulting, and Maciej Zgadzaj, but most uses of MongoDB transactions we found ran—either intentionally or inadvertently—with settings that would (in general) allow write loss and aborted reads.

基本上就是一個老大被惹怒了,丟出來炸,而且看他的語氣還有很多東西沒測,打算要再炸一篇?

PostgreSQL 對 fsync() 的行為傷腦筋...

FOSDEM 2019 上的演講,討論 PostgreSQL 在確保 ACID 特性中的 Durability 時遇到 fsync() 的行為跟預想的不一樣 (主要是當 fsync() 失敗的行為):「PostgreSQL vs. fsync」。

在「PostgreSQL vs. fsync. How is it possible that PostgreSQL used fsync incorrectly for 20 years, and what we'll do about it.」這邊的 Q&A 形式的訪談有快速描述了短期的計畫與長期的想法:

The short-term solution is ensuring that we detect fsync errors reliably at least on sufficiently recent kernels (since 4.13). On older kernels we can’t do much better, unfortunately.

The long-term solution is still being discussed in the community, but it’s hard to say how we could keep relying on buffered I/O in the future. So we may end up with direct I/O, but that’s a pretty significant change and is likely going to be a multi-year project.

MySQL 這邊則是以 O_DIRECT 為主的世界,受到的影響就小很多了...

用 Go 寫的 Badger

Dgraph 在推銷自家發展出來的 Badger:「Introducing Badger: A fast key-value store written natively in Go」。

標靶是 RocksDB,號稱比 RocksDB 快好幾倍:

Based on benchmarks, Badger is at least 3.5x faster than RocksDB when doing random reads. For value sizes between 128B to 16KB, data loading is 0.86x - 14x faster compared to RocksDB, with Badger gaining significant ground as value size increases. On the flip side, Badger is currently slower for range key-value iteration, but that has a lot of room for optimization.

不過我覺得有些重要的功能在 Badger 不提供,這比起來有種橘子比蘋果的感覺... 像是 RocksDB 提供了 Transaction,而 Badger 則是直接講明他們不打算支援 Transaction:

Keep it simple, stupid. No support for transactions, versioning or snapshots -- anything that can be done outside of the store should be done outside.

Google 的 Cloud Spanner

GoogleCloud Spanner 這個服務拿出來賣了:「Introducing Cloud Spanner: a global database service for mission-critical applications」,以及說明的「Inside Cloud Spanner and the CAP Theorem」。

Cloud Spanner 的規劃上是希望有 RDBMS 的能力 (像是 ACID 特性),又有強大的擴充能力 (scalability) 與可用性 (availability):

Today, we’re excited to announce the public beta for Cloud Spanner, a globally distributed relational database service that lets customers have their cake and eat it too: ACID transactions and SQL semantics, without giving up horizontal scaling and high availability.

在說明裡有提到 Cloud Spanner 是做到 CAP theorem 裡面的 CP:

The purist answer is “no” because partitions can happen and in fact have happened at Google, and during some partitions, Spanner chooses C and forfeits A. It is technically a CP system.

然後把 A 拉高到使用者不會在意 downtime 的程度:

However, no system provides 100% availability, so the pragmatic question is whether or not Spanner delivers availability that is so high that most users don't worry about its outages.

當然,比較讓人爭議的是 Twitter 上 Google Cloud 官方帳號的 tweet,直接講同時解決了 CAP 三個條件:

價錢不算便宜,不過對於想要找方案的人至少有選擇...

Berkeley DB 的介紹

在滿滿都是 NoSQL 的世代中,意外在「Berkeley DB: Architecture」這邊看到 Berkeley DB 的介紹...

2006 年 Berkeley DB 的公司 SleepycatOracle 收購。在收購後 Oracle 改變了 open source 授權部份,從之前的 Sleepycat License 改成了 AGPLv3

Berkeley DB 算是早期功能很完整的 database library,由於 page level locking、crash-safe 加上有 transaction,也曾經被 MySQL 拿去當作 engine,不過在 MySQL 5.1 被拔掉:「14.5 The BDB (BerkeleyDB) Storage Engine」。

文章裡講了很多底層設計上的想法 (而非單純只說明「做了什麼」),以四個面向來討論。Buffer、Lock、Log 以及 Transaction,並且圍繞著 ACID 需求討論。

算是懷念的考古文?Google 弄出來的 LevelDBFacebook 接著改善的 RocksDB 的走向也不太一樣了,現在大家對 ACID 需求因為 NoSQL 盛行的關係又重新在檢視...

關於要不要使用 MySQL 這件事情...

前陣子 ant 將在 5xRuby 演講的投影片放出來:「技術演講:淺入淺出 MySQL & PostgreSQL」,另外在 kaif.io 上也有討論:「淺入淺出 MySQL & PostgreSQL // Speaker Deck」。

而國外剛好也有好幾篇文章都在討論 MySQL (InnoDB),其中「how innodb lost its advantage」這篇講到對 InnoDB 的壓縮感到悲觀...

另外 Pinterest 的「Learn to stop using shiny new things and love MySQL」這篇的時間點感覺上就是在回應上面某些想法。

下面是我的整理 (以及想法)。

MySQLPostgreSQL 都是很成熟的 RDBMS。

如果你對其中一種有經驗,那麼就用你熟悉的 RDBMS。如果你對兩者都有經驗,那麼你就憑自己的判斷選擇。

如果都沒有經驗呢?看你身邊的人用什麼就選什麼。

我在 5xRuby 時回答的比較輕鬆,但這是很實際的回答:你既然都不會用這些進階功能,那麼兩套其實對你都差不多。選一個可以問的到答案的就好。

反正真的夠大的時候,拿錢出來總是有方案可以解決問題。初期把力氣花在怎麼搞定產品吧,如果你不熟悉,這通常都不是你在這個時間應該花時間去研究的問題。