檔案壓縮順序造成壓縮率的差異

Hacker News Daily 上看到「Why are tar.xz files 15x smaller when using Python's tar library compared to macOS tar?」這篇,作者問了為什麼他用 Pythontarfile 壓出來比起用 tar 壓出來小了 15 倍,檔案都是 JSON 檔壓成 XZ 格式:

I'm compressing ~1.3 GB folders each filled with 1440 JSON files and find that there's a 15-fold difference between using the tar command on macOS or Raspbian 10 (Buster) and using Python's built-in tarfile library.

看到 1440 個檔案應該會有直覺是一分鐘一個檔案,跑一天的量...

隔天他把原因找出來了,在裝了 GNU Tar 並且加上 --sort='name' 參數後,壓出來的大小就跟 Python 的 tarfile 差不多了:

Ok, I think I found the issue: BSD tar and GNU tar without any sort options put the files in the archive in an undefined order.

After installing GNU tar on my Mac with:

brew install gnu-tar

And then tarring the same folder, but with the --sort option:

gtar --sort='name' -cJf zsh-archive-sorted.tar.xz /Users/user/Desktop/temp/tar/2021-03-11

I get a .tar.xz archive of 1.5 MB, equal to the archive created by the Python library.

底層的原因是檔名與檔案內容有正相關的相似度 (因為裡面都是 sensor 資料),依照檔名排序壓縮就等於把類似的 JSON 檔案放在一起壓,使得 xz 可以利用這點急遽拉高壓縮率:

My JSON files contain measurements from hundreds of sensors. Every minute I read out all sensors, but only a few of these sensors have a different value from minute to minute.

By sorting the files by name (which has the creation unixtime at the beginning of it), two subsequent files have very little different characters between them. Apparently this is very favourable for the compression efficiency.

遇到類似的情境可以當作 tuning 的一種,測試看看會不會變小很多...

Leave a Reply

Your email address will not be published.