MySQL GTID Replication 的惡搞修復

Percona 的「Database Daily Ops Series: GTID Replication」這篇在講當 MySQL 的 GTID Replication 爛掉時可能的修法,算是頗惡搞的方法,修好後還是要跑 pt-table-checksum 確認兩邊的資料是否一致,如果有狀況的話還是得拿出 pt-table-sync 同步。

第一招是用 pt-slave-restart,跳過會造成問題 SQL,讓他強制同步 (唔):

This passes the master’s UUID and it skips all global transactions breaking replication on a specific slave server[.]

第二招是 mysqlslavetrx,也是類似的作法,只是拿的是 MySQL 官方的工具來惡搞...

第三招是 Inject a Fake Transaction,其實就是手動自己做 XDDD

所以不管是哪招,做完後還是要記得跑 pt-table-{checksum,sync} 收尾,不然還是會爛掉...

MyRocks 與 InnoDB 對於不同硬碟效能的差異

在「MyRocks: use less IO on writes to have more IO for reads」這邊有提到當使用 Disk Array、Slow SSD 與 Fast SSD 時效能 MyRocks 與 InnoDB 的效能差異。

可以看到當 I/O 層能提供的 IOPS capacity 愈高,MyRocks 與 InnoDB 之間的差異就愈低... 換句話說,RocksDB 對 IOPS 比較有效率的應用,這點在「Why is MyRocks more write-efficient than InnoDB?」這邊也可以看出一些說明 (不過這篇的 InnoDB 完全沒用 COMPRESSED)。

算是推銷... 不過可以持續關注看看 :o

InnoDB 的 Isolation Level 以及 Performance Schema 對效能的影響

雖然 Mark Callaghan 現在的主力都在 MyRocks 上,但他還是對 InnoDB 上的效能頗關注 (畢竟是個成熟而且競爭的產品)。而這篇「Sysbench, InnoDB, transaction isolation and the performance schema」講到 MySQL 5.6.26 裡的 InnoDB,了解 isolation level 與 performance schema 對效能的差異。結果可以在這邊翻到。

關掉 performance schema 會讓效能變好是預期的,不過看起來比預期小很多。另外某些情況下 RR (REPEATABLE-READ) 的效能會比 RC (READ-COMMITTED) 好倒是頗意外,這邊也有給出原因:

Using repeatable-read boosts performance because it reduces the mutex contention from getting a consistent read snapshot as that is done once per transaction rather than once per statement.

不過看了看數據,純粹讀取的部份 RC 會在某些地方快一些,不過整體來說在 MySQL 5.6.26 上的 RR 與 RC 差異真的不算太明顯了...

Facebook 備份 MySQL 資料並且確認正確性的方法

Facebook 再多花了一些篇幅數對於 MySQL 資料備份以及確認正確性的方法:「Continuous MySQL backup validation: Restoring backups」。

首先是 Continuous Restore Tier (CRT) 這塊,可以看到他們在這塊很仰賴 HDFS 當作備份的第一層基地,包括了 Full logical backups (用 mysqldump)、Differential (diff) backups 以及 Binary log (binlog) backups (stream 進 HDFS)。

另外上了 GTID,對於後續的處理會比較方便:

All of our database servers also use global transaction IDs (GTIDs), which gives us another layer of control when replaying transactions from binlog backups.

在 CRT 這塊可以看到其實是拿現成的工具堆起來,不同單位會因為規模而有不同的作法。真正的重點反而在 ORC Restore Coordinator (ORC) 這塊,可以看到 Facebook 開發了大量的程式將回復這件事情自動化處理:

在收到回復的需求後,可以看到 Peon 會從 HDFS 拉資料出來,並且用 binlog replay 回去:

Peons contain all relevant logic for retrieving backups from HDFS, loading them into their local MySQL instance, and rolling them forward to a certain point in time by replaying binlogs. Each restore job a peon works on goes through these five stages[.]

也是因為 Facebook 對 MySQL 的用量大到需要自動化這些事情,才有這些東西...

PostgreSQL 對 Vacuum 效能的改善

在「No More Full-Table Vacuums」這邊提到了 PostgreSQL 在 vacuum 時效能的大幅改善,尤其是大型資料庫在 vacuum 時需要對整個表格從頭到尾掃一次以確保 transaction id 的正確性:

Current releases of PostgreSQL need to read every page in the database at least once every 2 billion write transactions (less, with default settings) to verify that there are no old transaction IDs on that page which require "freezing".

這動作在資料量大的機器上就會吃大量資源導致各種討厭的現象:

All of a sudden, when the number of transaction IDs that have been consumed crosses some threshold, autovacuum begins processing one or more tables, reading every page. This consumes much more I/O bandwidth, and exerts much more cache pressure on the system, than a standard vacuum, which reads only recently-modified page.

而作者送了 patch 改成只會讀還沒搞定的部份:

Instead of whole-table vacuums, we now have aggressive vacuums, which will read every page in the table that isn't already known to be entirely frozen.

要注意的是,agreesive vacuum 相較於 vacuum 會多吃很多資源,但可以打散掉 (有點像一次大 GC 導致 lag 與多次 minor GC 讓程式反應時間變得比較順暢的比較):

An aggressive vacuum still figures to read more data than a regular vacuum, possibly a lot more. But at least it won't read the data that hasn't been touched since the last aggressive vacuum, and that's a big improvement.

這個功能預定在 PostgreSQL 9.6 出現,不知道會不會變 default...

PostgreSQL 9.5 釋出,UPSERT!

PostgreSQL 9.5 正式發行,這次新增了大家期待已久的 UPSERT 功能:「PostgreSQL 9.5: UPSERT, Row Level Security, and Big Data」。

SQL:2003 正式定義出 UPSERT,被稱為 Merge,不過看網路上一般還是比較習慣 UPSERT 這個用法:

A relational database management system uses SQL MERGE (also called upsert) statements to INSERT new records or UPDATE existing records depending on whether condition matches.

也就是當沒資料的時候就 INSERT,有資料的時候就 UPDATE 的語法。常見的使用情境是拿來當 counter 用 (雖然這很傷資料庫的效能)。

沒有 UPSERT 的時候只能用 transaction 或是 store procedure 搭出來,效能上會比在 database engine 裡實作來的差,所以 UPSERT 還是被實作出來了。

繼續學 Isolation...

在「測試 MariaDB 上 Galera Cluster 的 Isolation」這邊與「從頭學一次 Isolation level 的 REPEATABLE-READ 與 SERIALIZABLE」對應了 Stripe 的「Call me Maybe: MariaDB Galera Cluster」以及 Percona 的「Clarification on “Call me Maybe: MariaDB Galera Cluster”」這兩篇文章,兩邊在空中交火...

而「Call me Maybe: Percona XtraDB Cluster」這篇是目前最新的一篇 (文章日期有點怪怪的,不應該是 2015/09/04)。

畢竟 Galera Cluster 是目前現在在用的產品,雖然跑起來也沒什麼大問題,但預先多了解一點總是比較保險。而要參與這場 Isolation 戰爭,又花了不少時間看幾份資料。(包括重頭讀的)

第一個當然是 1992 年發表的 SQL-92:「ISO/IEC 9075:1992, Database Language SQL」,在 SQL-92 定義了四個 Isolation 的層級,分別是 SERIALIZABLEREPEATABLE READREAD COMMITTEDREAD UNCOMMITTED

另外一篇常被拿出來討論的是 1995 年由微軟帶頭發表的「A Critique of ANSI SQL Isolation Levels」,主要是提出 Snapshot Isolation 的定義。

後來讀到覺得真的很怪的時候到 Stack Exchange 上的 Database Administrators 問:「Does SQL92's REPEATABLE-READ preclude Write Skew (A5B)?」,確認了我的想法。

這篇算是讀書心得吧,留給自己用的 :p

首先是 REPEATABLE READ 的定義是透過「現象」定義,只要符合這些條件就可以被稱作 REPEATABLE READ,這包括了保證不會出現 P1 ("Dirty read"):

1) P1 ("Dirty read"): SQL-transaction T1 modifies a row. SQL-transaction T2 then reads that row before T1 performs a COMMIT. If T1 then performs a ROLLBACK, T2 will have read a row that was never committed and that may thus be considered to have never existed.

以及保證不會出現 P2 ("Non-repeatable read"):

2) P2 ("Non-repeatable read"): SQL-transaction T1 reads a row. SQL-transaction T2 then modifies or deletes that row and performs a COMMIT. If T1 then attempts to reread the row, it may receive the modified value or discover that the row has been deleted.

並且保證 Atomicity 以及 no updates will be lost:

The four isolation levels guarantee that each SQL-transaction will be executed completely or not at all, and that no updates will be lost.

只要做到這些條件,就可以說符合 SQL-92 的 REPEATABLE READ 要求。


另外一方面,在「A Critique of ANSI SQL Isolation Levels」定義的 SQL-92 認為 REPEATABLE READ 是透過 locking 機制實作的 (整篇的寫法的確是朝這個方向走),並且基於 locking mechanism 來推論後續的行為,所以這篇 paper 裡面很多說明是帶著這個大前提的。

而當時已經有資料庫是使用 MVCC 架構在實作,可以透過 read snapshot 的方式找出另外一條路。

舉例來說,InnoDB 對 REPEATABLE READ 實作的方式是透過 snapshot 實作:(出自「MySQL :: MySQL 5.6 Reference Manual :: 13.3.6 SET TRANSACTION Syntax」)

For consistent reads, there is an important difference from the READ COMMITTED isolation level: All consistent reads within the same transaction read the snapshot established by the first read. This convention means that if you issue several plain (nonlocking) SELECT statements within the same transaction, these SELECT statements are consistent also with respect to each other.

這樣避開了 P1、P2 (甚至多做避開到了 P3,不過這邊不重要),並且 Atomicity。而 Update Lost 問題則再透過 lock 機制避免。所以 InnoDB 的 REPEATABLE READ 其實比 SQL-92 的要求多做到了很多事情...


到這邊再回頭看原來的討論文章,才會發現 Isolation 裡有滿滿的地雷可以採 XD

從頭學一次 Isolation level 的 REPEATABLE-READ 與 SERIALIZABLE

在「測試 MariaDB 上 Galera Cluster 的 Isolation」這篇提到了 StripeKyle Kingsbury 寫了「Call me Maybe: MariaDB Galera Cluster」這篇文章,在討論 Galera Cluster 的 Isolation 問題。

Percona 的 CTO Vadim Tkachenko 寫了回應的文章:「Clarification on “Call me Maybe: MariaDB Galera Cluster”」。

看完後再跑去翻了一些資料,首先是 SQL 92 對 Isolation 的原始定義,尤其是對 REPEATABLE-READ 的定義。

REPEATABLE-READ 是透過 P2 ("Non-repeatable read") 來定義的,只要保證不會發生 P2 就是 REPEATABLE-READ 了:

P2 ("Non-repeatable read"): SQL-transaction T1 reads a row. SQL-transaction T2 then modifies or deletes that row and performs a COMMIT. If T1 then attempts to reread the row, it may receive the modified value or discover that the row has been deleted.

也就是說,T1 讀過的 row 只要在 T1 transaction 的期間內都保持一樣就可以。由於沒有多餘其他保證,所以有機會產生 P3 ("Phantom") 問題:

P3 ("Phantom"): SQL-transaction T1 reads the set of rows N that satisfy some . SQL-transaction T2 then executes SQL-statements that generate one or more rows that satisfy the used by SQL-transaction T1. If SQL-transaction T1 then repeats the initial read with the same <search condition>, it obtains a different collection of rows.

SERIALIZABLEREPEATABLE-READ 的定義方式不同,並不是避免 P3 就可以叫做 SERIALIZABLE。SQL 92 定義說明必須與序列執行的結果一樣才能叫做 SERIALIZABLE

The execution of concurrent SQL-transactions at isolation level SERIALIZABLE is guaranteed to be serializable. A serializable execution is defined to be an execution of the operations of concurrently executing SQL-transactions that produces the same effect as some serial execution of those same SQL-transactions. A serial execution is one in which each SQL-transaction executes to completion before the next SQL-transaction begins.

所以依照定義,SERIALIZABLE 一定可以避免 P3 發生,但避免了 P3 不代表做到 SERIALIZABLE

另外一份資料是「MySQL :: MySQL 5.6 Reference Manual :: 13.3.6 SET TRANSACTION Syntax」對於 MySQL 5.6 中 InnoDB 實作 REPEATABLE-READ 的細節:

This is the default isolation level for InnoDB. For consistent reads, there is an important difference from the READ COMMITTED isolation level: All consistent reads within the same transaction read the snapshot established by the first read. This convention means that if you issue several plain (nonlocking) SELECT statements within the same transaction, these SELECT statements are consistent also with respect to each other. See Section 14.2.2.2, “Consistent Nonlocking Reads”.

InnoDB 的 REPEATABLE-READ 會在第一次讀取時建立一份 snapshot,所以 MySQL 的 REPEATABLE-READ 有達到不產生 P2 的要求,但也因為 snapshot 的關係而不會產生 P3,不過也就僅此而已,寫入的部份還是沒有達到 SERIALIZABLE 的要求。

回到原來的文章所整理出來的測試資料 (我稍微排版過),在這兩個不同的 transaction 同時發生時:(不同人都轉帳給 id = 8)

BEGIN; -- T1
SELECT balance FROM accounts WHERE account_id = 5;
SELECT balance FROM accounts WHERE account_id = 8;
UPDATE account SET balance = 75 WHERE account_id = 5;
UPDATE account SET balance = 125 WHERE account_id = 8;
COMMIT;
BEGIN; -- T2
SELECT balance FROM accounts WHERE account_id = 6;
SELECT balance FROM accounts WHERE account_id = 8;
UPDATE account SET balance = 80 WHERE account_id = 6;
UPDATE account SET balance = 120 WHERE account_id = 8;
COMMIT;

SERIALIZABLE 保證執行結果會與某種循序的結果相符 (可能是 T1 -> T2,也可能是 T2 -> T1)。但 REPEATABLE-READ 只保證在 transaction 裡面看到的 row 不會變化。

在 MySQL 裡,如果你想要跟 MySQL 告知「變更的結果會收到某些欄位的值的影響」,你至少要用 LOCK IN SHARE MODE 來做 (或是用 FOR UPDATE 會更清楚表示出意思),也就是:

BEGIN; -- T1
SELECT balance FROM accounts WHERE account_id = 5 LOCK IN SHARE MODE;
SELECT balance FROM accounts WHERE account_id = 8 LOCK IN SHARE MODE;
UPDATE account SET balance = 75 WHERE account_id = 5;
UPDATE account SET balance = 125 WHERE account_id = 8;
COMMIT;
BEGIN; -- T2
SELECT balance FROM accounts WHERE account_id = 6 LOCK IN SHARE MODE;
SELECT balance FROM accounts WHERE account_id = 8 LOCK IN SHARE MODE;
UPDATE account SET balance = 80 WHERE account_id = 6;
UPDATE account SET balance = 120 WHERE account_id = 8;
COMMIT;

這樣至少會有一個 transaction 會出現 deadlock 訊息,於是就會確保是正確的結果:

ERROR 1213 (40001): Deadlock found when trying to get lock; try restarting transaction

值得一提的是,Percona 文章提到的第二種解法是有問題的:

BEGIN;
SELECT balance FROM accounts WHERE account_id = 5;
SELECT balance FROM accounts WHERE account_id = 8;
UPDATE account SET balance = balance - 25 WHERE account_id = 5;
UPDATE account SET balance = balance + 25 WHERE account_id = 8;
COMMIT;

這樣「總和」會因為保證原子性 (atomicity) 所以不會產生問題,但 balance 可能會產生出負數。

所以結論是 Kyle Kingsbury 那篇文章測試的方法本來就有問題,在 SELECT 時少了 LOCK IN SHARE MODE 或是 FOR UPDATE 的要求。

Berkeley DB 的介紹

在滿滿都是 NoSQL 的世代中,意外在「Berkeley DB: Architecture」這邊看到 Berkeley DB 的介紹...

2006 年 Berkeley DB 的公司 SleepycatOracle 收購。在收購後 Oracle 改變了 open source 授權部份,從之前的 Sleepycat License 改成了 AGPLv3

Berkeley DB 算是早期功能很完整的 database library,由於 page level locking、crash-safe 加上有 transaction,也曾經被 MySQL 拿去當作 engine,不過在 MySQL 5.1 被拔掉:「14.5 The BDB (BerkeleyDB) Storage Engine」。

文章裡講了很多底層設計上的想法 (而非單純只說明「做了什麼」),以四個面向來討論。Buffer、Lock、Log 以及 Transaction,並且圍繞著 ACID 需求討論。

算是懷念的考古文?Google 弄出來的 LevelDBFacebook 接著改善的 RocksDB 的走向也不太一樣了,現在大家對 ACID 需求因為 NoSQL 盛行的關係又重新在檢視...

Galera Cluster (Percona XtraDB Cluster) 對寫入熱點的處理方式

Percona 的「Optimizing Percona XtraDB Cluster for write hotspots」這篇說明了在 Galera Cluster 上發生多個 transaction 衝突時的處理方式。

不過解法是早就已經知道的方法,也就是把寫入都丟到同一台上處理:

So what are your options with PXC when the workload has write hotspots? The most obvious one is to write on a single node: then you will have the same locking model as with a standalone InnoDB server. Performance will not be as good as with a standalone server as you will have to pay for synchronous replication, but you will avoid the very expensive write conflicts.

講解的文章,對之後 troubleshooting 與 performance tuning 有幫助...