Category Archives: Database

單機 10 萬個連線 MySQL

也是在「Links: February 2019」這邊看到的,裡面提到了 Percona 的「MySQL Challenge: 100k Connections」。

Percona 的測試是希望每個連線都有在做事,而不是 idle connection,這個測試有點像是卡住時的情況?看起來只有這幾個參數比較特別:

table_open_cache = 200000
back_log=3500
max_connections=110000
max_prepared_stmt_count=1000000

max_connections 開多一點算是廢話,然後因為要做事所以 max_prepared_stmt_count 也多一些,back_log 可以讓 kernel 保留來不及處理的 TCP 連線。

看起來用 sysbench 測試還撐的住,跟理論差不多,隨著連線數的增加 latency 也會增加...

PostgreSQL 對 fsync() 的修正

上次寫了「PostgreSQL 對 fsync() 的行為傷腦筋...」提到 fsync() 有些地方是與開發者預期不同的問題,但後面忘記跟進度...

剛剛看到 Percona 的人寫了「PostgreSQL fsync Failure Fixed – Minor Versions Released Feb 14, 2019」這篇才發現在 2/14 就出了對應的更新,從 release notes 也可以看到:

By default, panic instead of retrying after fsync() failure, to avoid possible data corruption (Craig Ringer, Thomas Munro)

Some popular operating systems discard kernel data buffers when unable to write them out, reporting this as fsync() failure. If we reissue the fsync() request it will succeed, but in fact the data has been lost, so continuing risks database corruption. By raising a panic condition instead, we can replay from WAL, which may contain the only remaining copy of the data in such a situation. While this is surely ugly and inefficient, there are few alternatives, and fortunately the case happens very rarely.

A new server parameter data_sync_retry has been added to control this; if you are certain that your kernel does not discard dirty data buffers in such scenarios, you can set data_sync_retry to on to restore the old behavior.

現在的 workaround 是遇到 fsync() 失敗時為了避免 data corruption,會直接 panic 讓整個 PostgreSQL 從 WAL replay 記錄,也代表 HA 機制 (如果有設計的話) 有機會因為這個原因被觸發...

不過也另外設計了 data_sync_retry,讓 PostgreSQL 的管理者可以硬把這個 panic 行為關掉,改讓 PostgreSQL 重新試著 fsync(),這應該是在之後 kernel 有修改時會用到...

從 Microsoft SQL Server 轉移到 PostgreSQL 的工具

在「How to Migrate from Microsoft SQL Server to PostgreSQL」這邊看到作者的客戶需要把 Microsoft SQL Server 轉移到 PostgreSQL (但沒有提到原因)。

裡面主要是兩個階段的轉換,第一個階段是 schema 的轉換,作者提到了 dalibo/sqlserver2pgsql 這個用 Perl 寫的工具:

Migration tool to convert a Microsoft SQL Server Database into a PostgreSQL database, as automatically as possible http://dalibo.github.io/sqlserver2pgsql

第二個階段是資料的轉換,是選擇用 Pentaho Data Integration 的 Community Edition:

Pentaho offers various stable data-​centric products. Pentaho Data Integration (PDI) is an ETL tool which provides great support for migrating data between different databases without manual intervention. The community edition of PDI is good enough to perform our task here. It needs to establish a connection to both the source and destination databases. Then it will do the rest of work on migrating data from SQL server to Postgres database by executing a PDI job.

所以用兩個工具串起來... 另外在文章裡面沒提到 stored procedure 之類的問題,應該是他們的客戶沒用到或是很少用到?

PostgreSQL 對 fsync() 的行為傷腦筋...

FOSDEM 2019 上的演講,討論 PostgreSQL 在確保 ACID 特性中的 Durability 時遇到 fsync() 的行為跟預想的不一樣 (主要是當 fsync() 失敗的行為):「PostgreSQL vs. fsync」。

在「PostgreSQL vs. fsync. How is it possible that PostgreSQL used fsync incorrectly for 20 years, and what we'll do about it.」這邊的 Q&A 形式的訪談有快速描述了短期的計畫與長期的想法:

The short-term solution is ensuring that we detect fsync errors reliably at least on sufficiently recent kernels (since 4.13). On older kernels we can’t do much better, unfortunately.

The long-term solution is still being discussed in the community, but it’s hard to say how we could keep relying on buffered I/O in the future. So we may end up with direct I/O, but that’s a pretty significant change and is likely going to be a multi-year project.

MySQL 這邊則是以 O_DIRECT 為主的世界,受到的影響就小很多了...

Braintree (PayPal) 用 PostgreSQL 的方式

RDBMS 最困難的事情都圍繞在「怎麼不中斷服務」(很多事情在不用考慮 uptime/downtime 的前提下很好做,不論是 ALTER 或是 failover,到備份還原計畫),而 PayPalBraintree 在「PostgreSQL at Scale: Database Schema Changes Without Downtime」這邊討論修改 PostgreSQL 的 database schema 時怎麼不中斷服務。

文章內的大部份都是給 DBA 知道的細節 (e.g. 怎麼樣才不會觸發大規模的 lock 導致服務中斷),而不是開發者面向的事情... 但開頭的部份,也是我認為最重要的部份,則是需要 Developer 參與的:

For all code and database changes, we require that:

  • Live code and schemas be forward-compatible with updated code and schemas: this allows us to roll out deploys gradually across a fleet of application servers and database clusters.
  • New code and schemas be backward-compatible with live code and schemas: this allows us to roll back any change to the previous version in the event of unexpected errors.

為了符合這兩個要素,可能會在 schema 設計上有好幾個階段的操作,而非一次到位。而且也才能避免要關站從 backup 倒資料回來的情況...

建議可以研究看看要怎麼玩,常見的情境知道怎麼設計步驟後,真的遇到的時候會比較熟練。

JPMorgan Chase 的 WePay 用的 MySQL 架構

看到「Highly Available MySQL Clusters at WePay」這篇講 WePayMySQL 的設計,本來以為是 WeChat 的服務,仔細看查了之後發現原來是 JPMorgan Chase 的服務...

架構在 GCP 上面,本來的 MySQL 是使用 MHA + HAProxy (patch 過的版本,允許動態改變 pool),然後用 Routes 處理 HAProxy 的 failover。

他們遇到的問題是 crash failover 需要至少 30 分鐘的切換時間,另外就是在 GCP 上面跨區時會有的 network partition 問題...

後續架構變得更複雜,讓人懷疑真的有解決問題嗎 XDDD

改用 GitHub 推出的 Orchestrator 架構,然後用兩層 HAProxy 導流 (一層放在 client side,另外一層是原來架構裡面的 load balancer),在加上用 Consul 更新 HAProxy 的資訊?

思考為什麼會有這樣設計 (考慮到金融體系的背景),其實還蠻有趣的...

DynamoDB Autoscaling 的各種眉眉角角...

AdRollDynamoDB Autoscaling 的踩雷記錄,裡面有些資訊如果不是跳下去玩應該不會注意到 (魔鬼藏在細節裡的感覺):「Managing DynamoDB Autoscaling with Lambda and Cloudwatch」。

第一個提到的問題是 autoscaling 的觀察對象:

Ideally, the table should scale based on the number of requests that we are making , not the number of requests that are successful.

另外一個是 autoscaling 遇到完全不用的情況下不會 scale down,看起來是某種保護機制。但這使得平常只有拿來讀取的表格在跑完 batch job 後得自己處理 write scale down 問題:

Additionally, at the time of implementing this algorithm, the DynamoDB capacity could not be brought down automatically if the consumption was exactly zero, which can happen if you write to your table in batch instead of realtime, for example.

This meant that, when enabling autoscaling, tables that were read in realtime, but written to in batch, still needed manual intervention to bring the write capacity down after our jobs were done writing.

另外一個問題是 scale down 是有次數限制的:

Another interesting point that might bite users is that capacity decreases are an expensive operation for AWS, so they’re limited.

The number of decreases cited in the documentation can be achieved under very special conditions, since you need to have 4 decreases in the first hour of the day plus one for each of the remaining hours, for a total of 4 (first hour) + 23 (1 hourly) = 27.

後面就是自己研究什麼 algorithm 可以調整的更細,然後用 lambda 重寫... 最後省下 30% 的成本:

Here is where we detected our costs for our batch tables dropping to around 30% of the initial cost.

AdRoll 的規模應該是不小,所以為了省 30% 可以花不少力氣在上面...

TiDB 單機效能

TiDB 是一個支援分散式運算的資料庫,希望能夠完整地模擬 MySQL Protocol,而 Percona 試著測試 TiDB 在單機的效能,雖然測試的項目很簡單,但結果頗有趣的:「A Quick Look into TiDB Performance on a Single Server」。

Percona 觀察到的現象是 TiDB 對於單一 SQL query 支援多 CPU 運算 (MySQL 只會使用單 CPU),所以在高階的機器上,某些 SQL query 會快很多。而 OLAP 類型的 SQL query 也不錯,但常見的 OLTP 應用則慢不少:

Short version: TiDB supports parallel query execution for selects and can utilize many more CPU cores – MySQL is limited to a single CPU core for a single select query. For the higher-end hardware – ec2 instances in my case – TiDB can be 3-4 times faster for complex select queries (OLAP workload) which do not use, or benefit from, indexes. At the same time point selects and writes, especially inserts, can be 5x-10x slower. Again, please note that this test was on a single server, with a single TiKV process.

是個有趣的 drop-in...

Mark Callaghan 花五分鐘介紹 LSM trees

實做 MyRocksMark Callaghan 花五分鐘在 CIDR 2019 上介紹 LSM tree:「Geek code for LSM trees」。翻了一下發現 CIDR 是兩年辦一次,跟之前遇過的 conference 不太一樣...

投影片在「Diversity of LSM tree shapes」這邊可以看到,在五分鐘內講完的前提下規劃出的重點...