In 2019, mysql1 answered 950,000 queries/s on average, 900,000 queries/s on replicas, and 50,000 queries/s on the primary.
在 2021 年的時候變成 1.125M qps,其中 75K qps 在 primary 上:
Today, in 2021, the same database tables are spread across several clusters. In two years, they saw continued growth, accelerating year-over-year. All hosts of these clusters combined answer 1,200,000 queries/s on average (1,125,000 queries/s on replicas, 75,000 queries/s on the primaries). At the same time, the average load on each host halved.
另外這幾年比較成熟的方案都拿出來用了,包括用 ProxySQL 降低連線數的壓力 (connection pool 的概念):
[W]e started using ProxySQL to reduce the number of connections opened against our primary MySQL instances.
ProxySQL is used for multiplexing client connections to MySQL primaries.
Vitess is a scaling layer on top of MySQL that helps with sharding needs. We use its vertical sharding feature to move sets of tables together in production without downtime.
看起來是 2019 年年初的時候 MySQL 5.1 出問題,後續決定安排升級,在 2019 年年中把系統升級到 MySQL 5.7 (Percona Server 版本):
Our first major hurdle was to get current with our version of MySQL. In July, 2019 we completed the MySQL 5.1 to MySQL 5.7 (v5.7.19-17-log Percona Server to be precise) upgrade across all MySQL instances.
Not only was support for MySQL 5.1 at End-of-Life (more than 5 years ago) but our MySQL 5.1 instances on EC2/AWS had limited storage and we were scheduled to run out of space at the end of July. Our backs were up against the wall and we had to deliver!
另外在升級到 5.7 的時候,順便把本來是 INT 的 primary key 都換成 BIGINT:
As part of the cut-over to MySQL 5.7, we also took the opportunity to bake in a number of improvements. We converted all primary key columns from INT to BIGINT to prevent hitting MAX value.
In parallel with the MySQL 5.7 upgrade we also Upgraded Django to 1.6 due a behavioral change in MySQL 5.7 related to how transactions/commits were handled for SELECT statements. This behavior change was resulting in errors with older version of Python/Django running on MySQL 5.7
Eventbrite had traditionally used pt-online-schema-change (pt-osc) to ALTER MySQL tables in production. pt-osc uses MySQL triggers to move data from the original to the “duplicate” table which is a very expensive operation and can cause replication lag. Matter of fact, it had directly resulted in several outages in H1 of 2019 due to replication lag or breakage.
Next on the list was implementing improvements to MySQL high availability and automatic failover using Orchestrator. In February of 2020 we implemented a new HAProxy layer in front of all DB clusters and we released Orchestrator to production!
Orchestrator can successfully detect the primary failure and promote a new primary. The goal was to implement Orchestrator with HAProxy first and then eventually move to Orchestrator with ProxySQL.
然後最後題到了 Square 研發的 Shift,把 gh-ost 包裝起來變成有個 web UI 可以操作: