原來有專有名詞:TOCTOU (Time-of-check to time-of-use)

看「The trouble with symbolic links」這篇的時候看到的專有名詞:「TOCTOU (Time-of-check to time-of-use)」,直翻是「先檢查再使用」,算是一個常見的 security (hole) pattern,因為檢查完後有可能被其他人改變,接著使用的時候就有可能產生安全漏洞。

在資料庫這類環境下,有 isolation (ACID 裡的 I) 可以確保不會發生這類問題 (需要 REPEATABLE-READ 或是更高的 isolation level)。

但在檔案系統裡面看起來不太順利,2004 年的時候研究出來沒有 portable 的方式可以確保避免 TOCTOU 的問題發生:

In the context of file system TOCTOU race conditions, the fundamental challenge is ensuring that the file system cannot be changed between two system calls. In 2004, an impossibility result was published, showing that there was no portable, deterministic technique for avoiding TOCTOU race conditions.

其中一種 mitigation 是針對 fd 監控:

Since this impossibility result, libraries for tracking file descriptors and ensuring correctness have been proposed by researchers.

然後另外一種方式 (比較治本) 是檔案系統的 API 支援 transaction,但看起來不被主流接受?

An alternative solution proposed in the research community is for UNIX systems to adopt transactions in the file system or the OS kernel. Transactions provide a concurrency control abstraction for the OS, and can be used to prevent TOCTOU races. While no production UNIX kernel has yet adopted transactions, proof-of-concept research prototypes have been developed for Linux, including the Valor file system and the TxOS kernel. Microsoft Windows has added transactions to its NTFS file system, but Microsoft discourages their use, and has indicated that they may be removed in a future version of Windows.

目前看起來的問題是沒有一個讓 Linux community 能接受的 API 設計?

Git 2.37.0 對巨大 Monorepo 的加速功能 FSMonitor

這邊用 GitHub 寫的說明好了:「Improve Git monorepo performance with a file system monitor」。

從 2.37.0 開始,Windows 與 Mac 版的使用者可以透過 FSMonitor 的功能記錄檔案系統的變化,大幅減少需要 scan 整個 repository 的時間,可以看到啟用後對於像是 chromium 這種大型專案的 status 時間就大幅下降了:

不過 Linux 還沒支援,目前我的環境都是 Linux,就沒辦法用了...

Amazon EFS 效能提昇的一些討論

上一篇「Amazon EFS 的效能提昇」提到 Amazon EFS 的效能提昇,在 Hacker News 上看到 Amazon EFS 團隊的 PMT (Product-Manager-Technical) 出來回一些東西:「Amazon Elastic File System Update – Sub-Millisecond Read Latency (amazon.com)」,搜尋 geertj 應該就可以看到他回的東西了...

像是即使是 Jeff Barr 發表這篇文章,也還是經過 legal team 的同意才能發表:

(PMT on the EFS team).

Yes, the wordings are carefully formulated as they have to be signed off by the AWS legal team for obvious reasons. With that said, this update was driven by profiling real applications and addressing the most common operations, so the benefits are real. For example, a simple WordPress "hello world" is now about 2x as fast as before.

另外這次的效能提昇是透過 cache 層達成的:

I'm the PMT for this project in the EFS team. The "flip the switch" part was indeed one of the harder parts to get right. Happy to share some limited details. The performance improvement builds on a distributed consistent cache. You can enable such a cache in multiple steps. First you deploy the software across the entire stack that supports the caching protocol but it's disabled by configuration. Then you turn it for the multiple components that are involved in the right order. Another thing that was hard to get right was to ensure that there are no performance regressions due to the consistency protocol.

然後在每個 AZ 都有 cache:

The caches are local to each AZ so you get the low latency in each AZ, the other details are different. Unfortunately I can't share additional details at this moment, but we are looking to do a technical update on EFS at some point soon, maybe at a similar venue!

另外看起來主要就是 metadata cache 的幫助:

NFS workloads are typically metadata heavy and highly correlated in time, so you can achieve very high hit rates. I can't share any specific numbers unfortunately.

還是有很多細節數字不能透漏,但知道是透過 cache 達成的就已經可以大致上想像後面是怎麼弄出來的了...

Amazon EFS 的效能提昇

AWS 宣佈他們將 Amazon EFS 的 latency 大幅降低以提昇效能:「Amazon Elastic File System Update – Sub-Millisecond Read Latency」。

Linux 上一般是用 NFS 掛 EFS,個位數的 ms 的確對於效能影響超大,現在宣稱讀取的部份降到 0.6ms,應該會有蠻明顯的感覺:

Up until today, EFS latency for read operations (both data and metadata) was typically in the low single-digit milliseconds. Effective today, new and existing EFS file systems now provide average latency as low as 600 microseconds for the majority of read operations on data and metadata.


This performance boost applies to One Zone and Standard General Purpose EFS file systems. New or old, you will still get the same availability, durability, scalability, and strong read-after-write consistency that you have come to expect from EFS, at no additional cost and with no configuration changes.

另外就是過去幾個禮拜他們把現有的 EFS 都轉移過去了:

We “flipped the switch” and enabled this performance boost for all existing EFS General Purpose mode file systems over the course of the last few weeks, so you may already have noticed the improvement. Of course, any new file systems that you create will also benefit.

不過 EFS 另外一個問題就是貴炸,用錢換方便...

Amazon EFS 提供 Replication 功能

Jeff Barr 在官方 blog 上宣佈 Amazon EFS 提供 replication 功能:「New – Replication for Amazon Elastic File System (EFS)」。


在建起來以後會是 read-only filesystem:

另外有提供 fail-over 機制,當 fail-over 過去後會從 read-only 變成 read-write。

不過要注意,架構上屬於 eventually consistent,預期是一分鐘內會更新。這點算是可以預期的,不然 latency 會太高:

All replication traffic stays on the AWS global backbone, and most changes are replicated within a minute, with an overall Recovery Point Objective (RPO) of 15 minutes for most file systems.

然後 replication 不會計算到 I/O 的 credit 與 throughput,算是比較特別的一點:

Replication does not consume any burst credits and it does not count against the provisioned throughput of the file system.

replication 這個服務本身不另外收費,只收取 EFS 使用的空間以及 replication 產生的頻寬費用:

You pay the usual storage fees for the original and replica file systems and any applicable cross-region or intra-region data transfer charges.

在 ZFS 上跑 PostgreSQL 的調校

在「Everything I've seen on optimizing Postgres on ZFS」這邊看到如果要在 ZFS 上面跑 PostgreSQL 時的調校方式,看起來作者有一直在更新這篇,所以需要的時候可以跑去看...

主要的族群是要搞 self-hosted PostgreSQL 的人,相較於 ext4 或是 XFS,底層如果使用 ZFS 可以做許多事情,像是 compression 與 snapshot,這對於很多 DBA 相關的操作會方便不少,但也因為 ZFS 的關係,兩邊 (& PostgreSQL) 需要一起調整以確保效能...

不過短期應該還是用 RDS 就是了...

MySQL 跑在 ZFS 與 ext4 的效能差異

Percona 的「MySQL/ZFS Performance Update」這篇又對 ZFS 做了一次測試,算是用比較新的軟體跑出來的結果,不過要注意這邊的 ZFS 版本仍然不是目前最新版:

ZFS 0.8.6-1 is not bleeding edge, there have been more than 1700 commits since and after 0.8.6, the ZFS release number jumped to 2.0. The big addition included in the 2.0 release is native encryption.

機器是在雲端上 (Azure 上),不熟悉 Azure 的機種,但看記憶體與 CPU 的量好像不是用頂規的機器:

benchmark host
Standard D2ds_v4 instance
2 vCpu, 8GB of Ram and 75 GB of temporary storage
Debian Buster

Database host
Standard E4-2ds-v4 instance
2 vCpu, 32GB of Ram and 150GB of temporary storage
256GB SSD Premium (SSD Premium LRS P15 – 1100 IOPS (3500 burst), 125 MB/s)
Debian Buster
Percona server 8.0.22-13


看了一下測試用的設定,似乎只測了 compression 的部份,沒測 snapshot 以及其他功能會對效能有什麼影響,但至少基本盤應該是還不錯?

Amazon EFS 推出 One Zone 版本

Amazon EFS 提供 One Zone 的版本,用較低的可靠度提供更低的價錢:「New – Lower Cost Storage Classes for Amazon Elastic File System」。

價錢大約是 53 折,不過要注意不在同一個 AZ 時使用會有頻寬費用:

Standard data transfer fees apply for inter-AZ or inter-region access to file systems.

目前想的到的是 /net/tmp 這類的用途,資料掉了也就算了,考慮到可靠度,其他的用途好像暫時想不到...

用 flock 防止指令同時間重複執行

flock(1) 是個好用的工具,可以避免指令同時重複執行。基本的用法是:

flock /tmp/example.lock /usr/bin/example

從這邊的 /tmp/example.lock 可以看出是透過檔案系統層的鎖實做的,也因為這樣的設計,所以用在 NFS 上時,可以規劃出跨機器的鎖定機制。(不過要注意舊版的 NFSv2/NFSv3 是可以刻意不跑 lock daemon 的,這樣就爛掉了 XD)

配合 crontab 執行時,-n 也是常用的選項,用在「這次沒跑也沒關係」的東西上,通常用在是計算出來後丟到 cache 內的程式,跑比較久的時候就跳過這次...

AWS 宣佈提昇 Amazon EFS 的最低效率

AWS 宣佈提昇 Amazon EFS 的最低效率:「Amazon Elastic File System increases file system minimum throughput」。


Amazon Elastic File System (Amazon EFS) file systems using the default bursting throughput mode now have a minimum throughput of 1 MiB/s. All EFS bursting mode file systems (regardless of size) can drive 100 MiB/s of throughput, and file systems with more than 1TiB of Standard class storage can drive 100 MiB/s per TB when burst credits are available. This change increases the minimum throughput from 50KiB/s per GiB of Standard class storage to a fixed minimum of 1 MiB/s for file systems with less than 20 GiB of Standard class storage, when burst credits are exhausted.

本來最低保證效率是每 GB 提供 50KB/sec,也就是要使用到 20GB 才會提供 1MB/sec,現在對於不到 20GB 的使用者,直接拉高到固定 1MB/sec。

這對於剛開始用的使用者會方便一些,不過 EFS 主要還是方便在不同機器上共享,效率上還是本機掛 EBS 好很多 (因為 OS 可以 cache)。

先前在 AWS 上把 /home 丟到 EFS 上面,結果因為 i/o 都需要透過網路的關係,編 pyenv 超慢,後來找一天把東西都丟回 EBS 上,速度快多了...