用 GPT-3 直接產生對應的 SQL query

Hacker News Daily 上看到的東西,直接給 GPT-3 條件,叫 GPT-3 給出對應的 SQL query,這就有點誇張了...:「Automating My Job with GPT-3」。

這樣丟進去:

Instruction: Given an input question, respond with syntactically correct PostgreSQL. Be creative but the SQL must be correct.

Input: how many users signed up in the past month?

然後這樣出來:

GPT-3 Response: SELECT COUNT(*) FROM users
WHERE signup_time > now() - interval '1 month'

把欄位名稱替換掉就可以用了... 而且接下來就更誇張了,直接要求 GPT-3 也要照著設定的 table schema 給答案:

Instruction: Given an input question, respond with syntactically correct PostgreSQL. Be creative but the SQL must be correct. Only use tables called "users" and "charges". The "users" table has columns: id (integer), signup_dt (timestamp), email (character varying), and plan_type (character varying). The "charges" table has columns: amount (bigint), user_id (integer), and charge_dt (timestamp).

Input: how much revenue did we have in the past 7 days?

然後輸出了:

GPT-3 Response: SELECT SUM(amount) FROM charges WHERE charge_dt > now() - interval '7 days'

接下來是在同樣 instruction 下,跨表格的問題:

Input: how much revenue have we had from users that signed up in the last 6 months?

這時候 INNER JOIN 就跑出來了:

.8 Temperature GPT-3 Response: SELECT SUM(charges.amount) FROM users INNER JOIN charges ON users.id = charges.user_id WHERE signup_dt >= DATE_SUB(now(), INTERVAL '6 months')

後面的問題也很精彩,看起來之後可以接上 BI dashboard,直接丟句子進去,然後拉各種資料出來視覺化?

Eventbrite 的 MySQL 升級計畫

在 2021 年看到 EventbiteMySQL 升級計畫:「MySQL High Availability at Eventbrite」。

看起來是 2019 年年初的時候 MySQL 5.1 出問題,後續決定安排升級,在 2019 年年中把系統升級到 MySQL 5.7 (Percona Server 版本):

Our first major hurdle was to get current with our version of MySQL. In July, 2019 we completed the MySQL 5.1 to MySQL 5.7 (v5.7.19-17-log Percona Server to be precise) upgrade across all MySQL instances.

然後看起來是直接在 EC2 上跑,不過這邊提到的空間問題就不太確定了,是真的把 EBS 的空間上限用完嗎?比較常使用的 gp2gp3 上限都是 16TB,不確定是不是真的用到接近爆掉了:

Not only was support for MySQL 5.1 at End-of-Life (more than 5 years ago) but our MySQL 5.1 instances on EC2/AWS had limited storage and we were scheduled to run out of space at the end of July. Our backs were up against the wall and we had to deliver!

另外在升級到 5.7 的時候,順便把本來是 INT 的 primary key 都換成 BIGINT

As part of the cut-over to MySQL 5.7, we also took the opportunity to bake in a number of improvements. We converted all primary key columns from INT to BIGINT to prevent hitting MAX value.

然後系統因為舊版的 Django 沒辦法配合 MySQL 5.7,得升級到 Django 1.6 (要注意 Django 1 系列的最新版是 1.11,看起來光是升級到 1.6 勉強會動就升不上去了?):

In parallel with the MySQL 5.7 upgrade we also Upgraded Django to 1.6 due a behavioral change in MySQL 5.7 related to how transactions/commits were handled for SELECT statements. This behavior change was resulting in errors with older version of Python/Django running on MySQL 5.7

然後採用了 GitHub 家研發的 gh-ost 當作改變 schema 的工具:

In December 2019, the Eventbrite DBRE successfully implemented a table ALTER via gh-ost on one of our larger MySQL tables.

看起來主要的原因是有遇到 pt-online-schema-change 的限制 (在「GitHub 發展出來的 ALTER TABLE 方式」這邊有提到):

Eventbrite had traditionally used pt-online-schema-change (pt-osc) to ALTER MySQL tables in production. pt-osc uses MySQL triggers to move data from the original to the “duplicate” table which is a very expensive operation and can cause replication lag. Matter of fact, it had directly resulted in several outages in H1 of 2019 due to replication lag or breakage.

另外一個引入的技術是 Orchestrator,看起來是先跟 HAProxy 搭配,不過他們打算要再換到 ProxySQL

Next on the list was implementing improvements to MySQL high availability and automatic failover using Orchestrator. In February of 2020 we implemented a new HAProxy layer in front of all DB clusters and we released Orchestrator to production!

Orchestrator can successfully detect the primary failure and promote a new primary. The goal was to implement Orchestrator with HAProxy first and then eventually move to Orchestrator with ProxySQL.

然後最後題到了 Square 研發的 Shift,把 gh-ost 包裝起來變成有個 web UI 可以操作:

2021 還可以看到這類文章還蠻有趣的...

Multithreading 版本 pt-online-schema-change

看起來是個嘗試,Percona 的人試著修改 pt-online-schema-change,讓他可以在 INSERT 時 multi-threading,然後看效果:「Multithreaded ALTER TABLE with pt-online-schema-change and myloader」。

可以看出來 thread 夠多的情況下其實都變快不少 (上圖主要是看絕對數字,下圖是看相對比率):

如果沒有意外的話應該會有更多的測試,而這些測試沒問題的話,之後的官方版本裡面應該就會有這個功能。

在 Galera Cluster 上的 DDL 操作 (e.g. ALTER TABLE)

Percona 整理了一份關於 Galara Cluster 上 DDL 操作的一些技巧,這包括了 Percona XtraDB ClusterMariaDB 的版本:「How to Perform Compatible Schema Changes in Percona XtraDB Cluster (Advanced Alternative)?」。

在不知道這些技巧前,一般都是拿 Percona Toolkit 裡的 pt-online-schema-change 來降低影響 (可以降的非常低),所以這些技巧算是額外知識,另外在某些極端無法使用 pt-online-schema-change 的情境下也可以拿來用...

裡面的重點就是 wsrep_OSU_method 這個參數,預設的值 TOI 就是一般性的常識,所有的指令都會被傳到每一台資料庫上執行,而 RSU 則是會故意不讓 DDL 操作 (像是 ALTER TABLE) 被 replicate 到其他機器,需要由管理者自己到每台機器上執行。

利用這個設定,加上透過工具將流量導到不同後端的資料庫上,就有機會分批進行修改,而不需要透過 pt-online-schema-change 這種工具。

從 Microsoft SQL Server 轉移到 PostgreSQL 的工具

在「How to Migrate from Microsoft SQL Server to PostgreSQL」這邊看到作者的客戶需要把 Microsoft SQL Server 轉移到 PostgreSQL (但沒有提到原因)。

裡面主要是兩個階段的轉換,第一個階段是 schema 的轉換,作者提到了 dalibo/sqlserver2pgsql 這個用 Perl 寫的工具:

Migration tool to convert a Microsoft SQL Server Database into a PostgreSQL database, as automatically as possible http://dalibo.github.io/sqlserver2pgsql

第二個階段是資料的轉換,是選擇用 Pentaho Data Integration 的 Community Edition:

Pentaho offers various stable data-​centric products. Pentaho Data Integration (PDI) is an ETL tool which provides great support for migrating data between different databases without manual intervention. The community edition of PDI is good enough to perform our task here. It needs to establish a connection to both the source and destination databases. Then it will do the rest of work on migrating data from SQL server to Postgres database by executing a PDI job.

所以用兩個工具串起來... 另外在文章裡面沒提到 stored procedure 之類的問題,應該是他們的客戶沒用到或是很少用到?

Braintree (PayPal) 用 PostgreSQL 的方式

RDBMS 最困難的事情都圍繞在「怎麼不中斷服務」(很多事情在不用考慮 uptime/downtime 的前提下很好做,不論是 ALTER 或是 failover,到備份還原計畫),而 PayPalBraintree 在「PostgreSQL at Scale: Database Schema Changes Without Downtime」這邊討論修改 PostgreSQL 的 database schema 時怎麼不中斷服務。

文章內的大部份都是給 DBA 知道的細節 (e.g. 怎麼樣才不會觸發大規模的 lock 導致服務中斷),而不是開發者面向的事情... 但開頭的部份,也是我認為最重要的部份,則是需要 Developer 參與的:

For all code and database changes, we require that:

  • Live code and schemas be forward-compatible with updated code and schemas: this allows us to roll out deploys gradually across a fleet of application servers and database clusters.
  • New code and schemas be backward-compatible with live code and schemas: this allows us to roll back any change to the previous version in the event of unexpected errors.

為了符合這兩個要素,可能會在 schema 設計上有好幾個階段的操作,而非一次到位。而且也才能避免要關站從 backup 倒資料回來的情況...

建議可以研究看看要怎麼玩,常見的情境知道怎麼設計步驟後,真的遇到的時候會比較熟練。

MySQL 裡 performance_schema 對效能的影響

最近在弄 MySQLperforemance_schema,開起來後發現效能影響沒有很大,跟印象中不太一樣... 找了一下文章發現 Percona 在 2017 年年初時有針對效能測試過:「Performance Schema Benchmarks: OLTP RW」。重點在這張圖:

圖上可以看到 Default 其實對效能的影響有限,另外文章也整理出來,有哪些設定對效能影響不會太大,可以考慮平常就開著:

Using Performance Schema with the default options, Memory, Metadata Locks and Statements instrumentation doesn’t have a great impact on read-write workload performance. You might notice slowdowns with Stages instrumentation after reaching 32 actively running parallel connections. The real performance killer is Waits instrumentation. And even with it on, you will start to notice a performance drop only after 10,000 transactions per second.

Pinboard 放出使用的 Database Schema

Twitter 上看到 Pinboard 放出他們的 DB Schema,可以看出他怎麼設計一個 bookmark site 的:

檔案在 Pinboard Database Schema 這邊可以看到。

比較好奇的是沒有用 utf8mb4,這代表 4 bytes 的 UTF-8 資料存不進去。另外是 user_tags 這個表格的設計方式,當初在使用 MySQL 設計 tag 架構也有類似的作法...

然後有些 index key 是多餘的 XDDD

performance_schema 的簡易用法

Mark Callaghan 寫了篇關於 performance_schema 的用法 (很短),讓大家先把這個參數開習慣,雖是入門推廣班:「Short guide on using performance_schema for user & table stats」。

他推薦的兩個資訊是:

select * from table_io_waits_summary_by_table
select * from events_statements_summary_by_account_by_event_name

當使用 5.7+ 時,可以考慮這兩個:

SELECT * FROM sys.schema_table_statistics
SELECT * FROM sys.user_summary

簡單到不行,但卻可以幫不少忙... 很棒的入門推廣班 XDDD

Google 與 Facebook 都在建立消息驗證系統

Google 的在「Fact Check now available in Google Search and News around the world」這,Facebook 的在「Working to Stop Misinformation and False News」這。

Google 是針對搜尋與新聞的部份給出建議,透過第三方的網站確認,像是這樣:

後面的機制是透過公開的協定進行:

For publishers to be included in this feature, they must be using the Schema.org ClaimReview markup on the specific pages where they fact check public statements (documentation here), or they can use the Share the Facts widget developed by the Duke University Reporters Lab and Jigsaw.

但也是透過演算法判斷提供的單位是否夠權威:

Only publishers that are algorithmically determined to be an authoritative source of information will qualify for inclusion.

而 Facebook 是針對 Timeline 上的新聞判斷,但是是透過與 Facebook 合作的 partner 判斷,而且會針對判斷為假的消息降低出現的機率:

We’ve started a program to work with independent third-party fact-checking organizations. We’ll use the reports from our community, along with other signals, to send stories to these organizations. If the fact-checking organizations identify a story as false, it will get flagged as disputed and there will be a link to a corresponding article explaining why. Stories that have been disputed also appear lower in News Feed.

我不是很喜歡 Facebook 的方法,變相的在控制言論自由 (不過也不是第一天了)。