Amazon EBS 在 Compliance mode 下的 Snapshot Lock

Jeff Barr 寫了「New – Amazon EBS Snapshot Lock」這篇,介紹 Amazon EBS 的新功能 Snapshot Lock。

從名字就知道是鎖住 snapshot 不讓人刪除,比較特別的是有兩個模式,第一個是 Governance,這個模式下就只是防止誤刪除的情況:

This mode protects snapshots from deletions by all users. However, with the proper IAM permissions, the lock duration can be extended or shortened, the lock can be deleted, and the mode can be changed from Governance mode to Compliance mode.

比較重要的是第二個模式 Compliance,在超過猶豫期 (cooling-off period) 後就不能動了,就算你有最大的權限 (我猜是連 root account 也不能動),唯一能操作的只有延長 lock 時間:

This mode protects snapshots from actions by the root user and all IAM users. After a cooling-off period of up to 72 hours, neither the snapshot nor the lock can be deleted until the lock duration expires, and the mode cannot be changed. With the proper IAM permissions the lock duration can be extended, but it cannot be shortened.

的確是遵循法規用的功能...

AWS 推出加速 Lambda 啟動速度的 Lambda SnapStart

今年 AWSre:Invent 又開始了,這一個禮拜會冒出蠻多新功能的,挑自己覺得比較有興趣得來寫。

AWS 針對 Lambda 推出 Lambda SnapStart,改善冷啟動的速度:「New – Accelerate Your Lambda Functions with Lambda SnapStart」。

他拿了一個比較明顯的例子,JavaSpring Boot,範例在「Serverless Spring Boot 2 example」這邊,冷啟動的速度可以從 6 秒降到 200ms:

SnapStart has reduced the cold start duration from over 6 seconds to less than 200 ms.

方法就是把 initialization 的程式完成後的記憶體打一份 snapshot 存起來,之後的冷啟動第一動變成是 restore 而非再 initialize:

With SnapStart, the initialization phase (represented by the Init duration that I showed you earlier) happens when I publish a new version of the function. When I invoke a function that has SnapStart enabled, Lambda restores the snapshot (represented by the Restore duration) before invoking the function handler. As a result, the total cold invoke with SnapStart is now Restore duration + Duration.

不過不是所有的應用程式都可以直接套用,有些要注意的地方,比較好理解的是連線 (像是對後端資料庫的預連線) 以及暫存檔的部份 (像是預先算好某些資料後寫到暫存檔) 都需要重新建立。

比較特別的是亂數產生器需要重新 initialize,不然會有機率產生出一樣的 random data,這個是一般開發者會忽略掉的:

When using SnapStart, any unique content that used to be generated during the initialization must now be generated after initialization in order to maintain uniqueness.

所以 AWS 有針對 SnapStart 下的 OpenSSL 修正,另外外他們也確認過 Java 的 java.security.SecureRandom 本身就沒問題:

We have updated OpenSSL’s RAND_Bytes to ensure randomness when used in conjunction with SnapStart, and we have verified that java.security.SecureRandom is already snap-resilient.

另外 AWS 也推薦可以直接讀系統的 /dev/random 或是 /dev/urandom,這樣就很自然的不會因為 snapshot 而固定,當然也就沒問題:

Amazon Linux’s /dev/random and /dev/urandom are also snap-resilient.

這個功能說不用另外收費,看起來對 Java 族群還不錯?

又一份講基本 RDBMS 的文件

前幾天在 Hacker News Daily 看到「Things You Should Know About Databases」這篇文章,裡面講了很多基本的 RDBMS 的概念,另外 Hacker News 上對應的討論在「Things to know about databases (architecturenotes.co)」這邊。

裡面講了 B-treeB+tree 的差異:

不過這點在維基百科上也蠻清楚的文字說明:

A B+ tree can be viewed as a B-tree in which each node contains only keys (not key–value pairs), and to which an additional level is added at the bottom with linked leaves.

另外裡面的 sorted 的那張圖:

這邊的說明不完全正確,在維基百科上的「Database index」這個條目裡面有提到 Non-clustered、Clustered 與 Cluster 三種架構,這邊圖片所表示的是 Non-clustered。在 InnoDB 裡面 data 是照 primary key 順序存放的 (沒有指定時會有一套邏輯選出哪個欄位當 PK,最後的情況是有 hidden key)。

再來就是提到 isolation,這邊也講的比較淺,只提到 ANSI 標準裡面的 SERIALIZABLEREPEATABLE READ (RR)、READ COMMITTED (RC) 與 READ UNCOMMITTED (RU) 四個,但沒提到像是 SNAPSHOT ISOLATION (SI) 這類的也很常見的標準。

說到 SI,在查 Snapshot isolation 的資料時整理了一下 PostgreSQL 的混亂情況。

在 PostgreSQL 9.0 以及更早前的版本,你指定 SERIALIZABLE 其實只有做到 Snapshot isolation 的等級,到了 9.1+ 後,SERIALIZABLE 才是真正做到 ANSI 定義的強度:

Snapshot isolation is called "serializable" mode in Oracle and PostgreSQL versions prior to 9.1, which may cause confusion with the "real serializability" mode.

另外 ANSI 定義的 isolation level 很難「用」 (但還是值得學起來,算是基本的東西),實際上的使用都是看各家資料庫對 isolation level 的保證程度來設計。

Percona XtraDB Cluster (PXC) 節點離開太久後的惡搞法

Percona 的「How To Recover Percona XtraDB Cluster 5.7 Node Without SST」這邊看到的技巧,不過只能用在 5.7 版,不能用在 8.0 版。我猜這個方法也可以用在其他跑 Galera Cluster 的資料庫上...

維護一組 Percona XtraDB Cluster 時一個常見的問題是,當節點離線太久後有機會無法用 IST (Incremental State Transfer) 跟回來,也就是只要把先前還沒有同步的部份更新進資料庫的方法,這時候就會需要用 SST (State Snapshot Transfer),變成抓整個 full copy。

作者提出來的方法是基於 IST 的大小通常比較小,但 binlog 通常都留蠻久的,所以可以利用 binlog 來幫 IST。

方法是先把 Galara Cluster 關掉,用 MySQL 傳統的 replication 同步到一定程度後,再把 IST 相關的位置設定指到已經同步的位置,接著再把 Galara Cluster 接上去就可以恢復了。

這個方法是 5.7 版限定,因為 8.0 的年代沒辦法改 Galara Cluster 的 wsrep 位置資訊:

Unfortunately, a similar solution does not work with Percona XtraDB Cluster 8.0.x, due to the modified way wsrep positions are kept in the storage engine, hence the trick with updating grastate.dat does not work as expected there.

我覺得可能 Percona 之後會弄出 patch 讓使用者可以改...

PostgreSQL 的 SERIALIZABLE 的 bug

這是 Jespen 第一次測試 PostgreSQL,就順利找出可重製的 bug 了:「PostgreSQL 12.3」。

第一個 bug 是 REPEATABLE READ 下的問題,不過因為 SQL-92 定義不夠嚴謹的關係,其實算不算是 bug 有討論的空間,這點作者 Kyle Kingsbury 在文章裡也有提出來:

Whether PostgreSQL’s repeatable-read behavior is correct therefore depends on one’s interpretation of the standard. It is surprising that a database based on snapshot isolation would reject the strict interpretation chosen by the seminal paper on SI, but on reflection, the behavior is defensible.

另外一個就比較沒問題了,是 SERIALIZABLE 下的 bug,在 SQL-92 下對 SERIALIZABLE 的定義是這樣:

The execution of concurrent SQL-transactions at isolation level SERIALIZABLE is guaranteed to be serializable. A serializable execution is defined to be an execution of the operations of concurrently executing SQL-transactions that produces the same effect as some serial execution of those same SQL-transactions. A serial execution is one in which each SQL-transaction executes to completion before the next SQL-transaction begins.

也就是說,在 SERIALIZABLE 下一堆 transaction 的執行結果,你至少可以找到一組排序,使得這些 transaction 的結果是等價的。

而 Jespen 順利找出了一組 transaction (兩個 transaction),在 SERIALIZABLE 下都成功 (但不應該成功):

對於這兩個 transaction,不論是上面這條先執行,還是下面這條先執行,都不存在等價的結果,所以不符合 SERIALIZABLE 的要求。

另外也找到一個包括三個 transaction 的情況:

把 transaction 依照執行的結果把 dependency 拉出來,就可以看出來裡面產生了 loop,代表不可能在 SERIALIZABLE 下三個都成功。

在 Jespen 找到這些 bug 後,PostgreSQL 方面也找到軟體內產生 bug 的部份,並且修正了:「Avoid update conflict out serialization anomalies.」,看起來是在 PostgreSQL 引入 Serializable Snapshot Isolation (SSI) 的時候就有這個 bug,所以 9.1 以後的版本都有這個問題...

這次順利打下來,測得很漂亮啊... 翻了一下 Jespen 上的記錄,發現好像還沒測過 MySQL,應該會是後續的目標?

MongoDB 的欺騙性廣告

Jepsen 最近丟出了一篇新的測試報告在測新版的 MongoDB 4.2.6,而且語氣看起來比以前兇很多,翻了一下前因後果,看起來起因是出自 Twitter 上的這則推,提到了 MongoDB 拿 Jepsen 宣傳的頁面:

然後 Jepsen 的官方帳號這邊也回應,覺得不可置信:

過兩個禮拜後 Jepsen 就丟出由老大 Kyle Kingsbury 發表的「Jepsen: MongoDB 4.2.6」,這篇測試 MongoDB 4.2.6 最新版的測試報告了。

在這篇報告裡面提到了很多不道德的行為,首先是在之前的測試發現有很多會掉資料的問題,但在 MongoDB 官方的宣傳文件「MongoDB and Jepsen」裡面則是完全沒提到,而且還宣稱有業界最強的資料一致性與正確性 (與 Jepsen 報告所提供的資料不符),所以 Jepsen 建議把這些問題列到這個頁面上,以避免使用者受到「誤解」:

Curiously, MongoDB omitted any mention of these findings in their MongoDB and Jepsen page. Instead, that page discusses only passing results, makes no mention of read or write concern, buries the actual report in a footnote, and goes on to claim:

MongoDB offers among the strongest data consistency, correctness, and safety guarantees of any database available today.

We encourage MongoDB to report Jepsen findings in context: while MongoDB did appear to offer per-document linearizability and causal consistency with the strongest settings, it also failed to offer those properties in most configurations. We think users might want to be aware that their database could lose data by default, but MongoDB’s summary of our work omits any mention of this behavior.

另外當然就是重測 MongoDB 4.2.6 版,沒時間看內容的人可以先瞄一下標題,裡面就已經點出不少東西了:

3 Results
3.1 Sometimes, Programs That Use Transactions… Are Worse
3.2 How ACID is Snapshot Isolation, Anyway
3.3 Indeterminate Errors
3.4 Duplicate Effects
3.5 Read Skew
3.6 Cyclic Information Flow
3.7 Read Your (Future) Writes

不過在最後面的 Discussion 比較清楚。

首先是批評 snapshot isolation 不是 ACID:

MongoDB 4.2.6 claims to offer “full ACID transactions” via snapshot isolation. However, the use of these transactions is complicated by weak defaults, confusing APIs, and undocumented error codes. Snapshot isolation is questionably compatible with the marketing phrase “full ACID”. Even at the highest levels of read and write concern, MongoDB’s transaction mechanism exhibited various anomalies which violate snapshot isolation.

Snapshot isolation is a reasonably strong consistency model, but claiming that snapshot isolation is “full ACID” is questionable.

而且即使把所有的資料安全性相關的設定都調到最高,也根本就做不到宣稱的 snapshot isolation:

Finally, even with the strongest levels of read and write concern for both single-document and transactional operations, we observed cases of G-single (read skew), G1c (cyclic information flow), duplicated writes, and a sort of retrocausal internal consistency anomaly: within a single transaction, reads could observe that transaction’s own writes from the future. MongoDB appears to allow transactions to both observe and not observe prior transactions, and to observe one another’s writes. A single write could be applied multiple times, suggesting an error in MongoDB’s automatic retry mechanism. All of these behaviors are incompatible with MongoDB’s claims of snapshot isolation.

過程中也發現就算設定了 snapshot 層級,MongoDB 在讀取時也不會遵守 snapshot isolation:

MongoDB’s default read and write concern for single-document operations remains local, which can observe uncommitted data, and w: 1, which can lose committed writes. Even when users select safer settings in their clients at the database or collection level, transactions ignore these settings and default again to local and w: 1. The snapshot read concern does not actually guarantee snapshot isolation, and must always be used in conjunction with write concern majority. This holds even for transactions which perform no writes.

然後所有的官方文件都沒有教 snapshot isolation 要怎麼設定,你必須在第三方的文件上才有機會找到:

Nor can users rely on examples to demonstrate snapshot isolated behavior. MongoDB’s transaction documentation and tutorial blog posts show only write-only transactions, using read concern local rather than snapshot. Other examples from MongoDB don’t specify a read concern or run entirely with defaults. Learn MongoDB The Hard Way uses read concern snapshot but write concern local, despite performing writes. Tutorials from DZone, Several Nines, Percona, The Code Barbarian, and Spring.io all claim that transactions are either ACID or offer snapshot isolation, but none set either read or write concern. There are some examples of MongoDB transactions which are snapshot isolated—for instance, from BMC, +N Consulting, and Maciej Zgadzaj, but most uses of MongoDB transactions we found ran—either intentionally or inadvertently—with settings that would (in general) allow write loss and aborted reads.

基本上就是一個老大被惹怒了,丟出來炸,而且看他的語氣還有很多東西沒測,打算要再炸一篇?

Amazon 的 Elasticsearch 服務提供十四天免費 hourly snapshot

Amazon Elasticsearch Service 提供 14 天免費的 hourly snapshot:「Amazon Elasticsearch Service increases data protection with automated hourly snapshots at no extra charge」。

Amazon Elasticsearch Service has increased its snapshot frequency from daily to hourly, providing more granular recovery points. If you need to restore your cluster, you now have numerous, recent snapshots to choose from. These automated snapshots are retained for 14 days at no extra charge.

不過這是 5.3+ 版本才有,舊版只有 daily:

  • For domains running Elasticsearch 5.3 and later, Amazon ES takes hourly automated snapshots and retains up to 336 of them for 14 days.
  • For domains running Elasticsearch 5.1 and earlier, Amazon ES takes daily automated snapshots (during the hour you specify) and retains up to 14 of them for 30 days.

In both cases, the service stores the snapshots in a preconfigured Amazon S3 bucket at no additional charge. You can use these automated snapshots to restore domains.

算是方便管理...

EC2 簡化了 Hibernation 的需求

因為在記憶體內會有各種敏感資訊,所以 EC2 的 Hibernation 當初推出時要求在寫到 snapshot 時要有加密,而 EC2 的設計需要使用 encrypted AMI 啟動,才能產生 encrypted snapshot:

Hibernation requires an EC2 instance be an encrypted EBS-backed instance. This ensures protection of sensitive contents in memory (RAM) as they get copied to the EBS upon hibernation.

這點對一般人來說就比較麻煩了,因為 AMI 裡面可能沒有敏感資訊,所以當初都是設計成 unencrypted 狀態,變成要多一些步驟...

而現在 EC2 宣佈可以可以用一般的 AMI 啟動並且產生出加密的 snapshot:「Enable Hibernation on EC2 Instances when launching with an AMI without an Encrypted EBS Snapshot」。

這樣一來省掉不少前置作業...

AWS 給 EBS 用的 Data Lifecycle Manager 在東京可以用了?

先前在「Amazon EBS Snapshot 支援 Lifecycle Management」這邊提到 AWS 設計了 Data Lifecycle Manager,讓 EBS 磁碟可以自動產生 snapshot 並且管理保留份數,可以當作某種備份機制。

七月公告當時只開放了少數幾區:

Availability – Data Lifecycle Manager is available in the US East (N. Virginia), US West (Oregon), and Europe (Ireland) Regions.

剛剛發現在東京也已經可以用了?但好像沒看到有公告提過... 設下去看看會不會動好了。

AWS 提供將 Lightsail 切換到 EC2 的功能

AWS 總算把 Lightsail 轉移到 EC2 的功能做出來了:「Amazon Lightsail Now Provides an Upgrade Path to EC2」。

這樣先從小站開始跑,搞大後想要改用 AWS 服務的切換成本就降低不少了。不過目前看起來是透過 export-and-import 做:

When you are ready to upgrade an instance to EC2, simply take a snapshot of your instance and follow the step-by-step process in Lightsail's console to export your snapshot to EC2. You can then use EC2 or the Upgrade to EC2 wizard in Lightsail's console to get your new EC2 instance up and running.

不過上次用 Lightsail 可以發現可用的 CPU 跟其他 VPS 比起來慢不少... 這點是一開始要不要選 Lightsail 的因素。