把文字檔壓成 PNG 跟 Gzip 比較壓縮率

Hacker News 上看到「Compressing Text into Images (shkspr.mobi)」這個奇怪的方法,是把文字轉成灰階的 PNG 檔案,然後跟 ZIP 比較壓縮率,居然是能打的。原文則是在「Compressing Text into Images」這邊。

這是屬於週末奇怪的 side project,作者也覺得這樣很白爛 XDDD

(This is, I think, a silly idea. But sometimes the silliest things lead to unexpected results.)

作者是拿羅密歐與茱麗葉的小說當資料 (Romeo and Juliet),這可以在 Project Gutenberg 的網站上下載到 txt 檔案。

基本的想法是把每個文字轉成灰階圖片的像素,然後用 PNG 無損輸出,最後再用 Squoosh 處理:

The English language and its punctuation are not very complicated, so the play only contains 77 unique symbols. The ASCII value of each character spans from 0 - 127. So let's create a greyscale image which each pixel has the same greyness as the ASCII value of the character.

結果居然是跟一般的壓縮軟體差不多:

That's down to 55KB! About 40% of the size of the original file. It is slightly smaller than ZIP, and about 9 bytes larger than Brotli compression.

我試著驗證作者的想法,用 Go 寫了個小程式輸出操作圖片,產稱一張寬度是小說的 byte 數,高度只有 1 的 PNG 檔輸出,最後再用 pngcrush 處理:「gslin/text-to-image-compression」。

另外我是與 gzip 比較 (原作者是與 ZIP 比較),結果是:

PNG: 64379
Gzip: 64563
Gzip (-9): 64331

的確是差不多的等級?

zlib、gzip 與 zip 的關係 (Mark Adler 的回覆)

Hacker News 上看到「How are zlib, gzip and zip related? (stackoverflow.com)」這個討論,原文是 StackOverflow 上 2013 年時的問題與回答:「How are zlib, gzip and zip related? What do they have in common and how are they different?」。

short form 算是解釋了這三個的關係,另外多帶出了其他相關的格式以及演算法:

.zip is an archive format using, usually, the Deflate compression method. The .gz gzip format is for single files, also using the Deflate compression method. Often gzip is used in combination with tar to make a compressed archive format, .tar.gz. The zlib library provides Deflate compression and decompression code for use by zip, gzip, png (which uses the zlib wrapper on deflate data), and many other applications.

而 long form 則是描述了整段歷史,完整到下面的 comment 有人提到希望能夠提供一些引用的連結,當作 reference:

This post is packed with so much history and information that I feel like some citations need be added incase people try to reference this post as an information source. Though if this information is reflected somewhere with citations like Wikipedia, a link to such similar cited work would be appreciated.

回答的作者 Mark Adler 也在 comment 上很霸氣的回:

I am the reference, having been part of all of that. This post could be cited in Wikipedia as an original source.

翻了一下他的事蹟:

老人家交代了當年的一些脈絡 (像是專利問題),會更容易理解為什麼現在發展成這樣...

gzip 的 --rsyncable (zstd 也有)

查資的時候 gzip 發現有 --rsyncable 這個參數,號稱是產生出對 rsync 友善的壓縮檔:

When you synchronize a compressed file between two computers, this option allows rsync to transfer only files that were changed in the archive instead of the entire archive. Normally, after a change is made to any file in the archive, the compression algorithm can generate a new version of the archive that does not match the previous version of the archive. In this case, rsync transfers the entire new version of the archive to the remote computer. With this option, rsync can transfer only the changed files as well as a small amount of metadata that is required to update the archive structure in the area that was changed.

這個參數的說明可以參考「Rsyncable gzip」這篇,從發表的日期是 2005 年就可以看出來這個參數已經很久了:

With this option, gzip will regularly “reset” his compression algorithm to what it was at the beginning of the file. So if for example there was a change at byte 23, this change will only affect the output up to maximum (for example) byte #9999. Then gzip will restart ‘at zero’, and the rest of the compressed output will be the same as what it was without the changed byte 23. This means that rsync will now be able to re-synchronise between the old and new compressed file, and can then avoid sending the portions of the file that were unmodified.

這個參數的想法是,正常狀態下的 gzip 會因為來源的微小改變,造成後續壓縮的內容都完全不一樣。

但加上 --rsyncable 後,gzip 就會定時重設壓縮狀態 (reset),於是讓壓縮後的輸出內容有大部分的內容重複,於是 rsync 就能夠偵測到相同內容而避免大量重傳。

但反過來的缺點面也就馬上可以想到,這是犧牲一些壓縮演算法的效率,付出的代價就是輸出的檔案會大一點。

這個功能在 zstd 上也有,不過 xz 就沒有...

我拿 zstd -19 (zstd 最高的壓縮率?) 測試 BBS 的備份,一般壓縮是 513672097 bytes,而加上 --rsyncable 後的壓縮是 513831761 bytes,發現是萬分之幾的增加,等於是只多了零頭...?

看起來會是蠻好用的參數,特地寫一篇記錄起來...

cURL 支援 Zstandard

在「curl 7.72.0 – more compression」這邊看到新版的 cURL 要支援 Zstandard 了,查了一下發現 Zstandard 有對應的 RFC,在 RFC 8478:「Zstandard Compression and the application/zstd Media Type」。

對應到 server 端的部份,看起來可以用 tokers/zstd-nginx-module 搭 (在 nginx 環境下),不然就是 application 端要自己壓縮了。

不過普及率比較高的演算法是 Google 主導的 Brotli,查了一下壓縮率大概在同一個等級。

Facebook 沒有自家瀏覽器,推這些東西比較辛苦一點,但這次 cURL 決定支援 Zstandard 算是一個開始,讓開發者多了一個選擇可以用...

補上 nginx 對 favicon 的壓縮...

從「Compressed favicons are 70% smaller but 75% of them are served uncompressed」這邊看到的,他們發現大約有 73.5% 的網站沒有壓縮 favicon.ico 檔:

The HTTP Archive dataset of favicons from 4 million websites crawled from desktop devices on May 2019 shows that 73,5 % of all favicons are offered without any compression with an average file size of 10,5 kiB, 21,5 % are offered with Gzip compression at an average file size of 4 kiB, and 5 % offer Brotli compression at an average file size of 3 kiB.

我自己的也沒加... 補上 gzip 相關的設定後,favicon.ico 的傳輸量從 4.2KB 降到 1.2KB。

我是使用 nginx,在 Ubuntu 上 nginx 的 nginx.conf 內 gzip 預設已經有開,所以只要增加一些設定讓他知道要處理 ico 檔案就可以了。

方法是在 /etc/nginx/conf.d/gzip.conf 裡面放:

gzip_comp_level 9;
gzip_types image/vnd.microsoft.icon image/x-icon;
gzip_vary on;

跟文章裡面提到的多了兩個設定,一個是 gzip_comp_level 改成 9 (預設是 1),另外有 gzip 時應該要在 Vary 表示,避免 cache 出錯。

Amazon API Gateway 支援壓縮了...

Amazon API Gateway 支援壓縮了:「Amazon API Gateway Supports Content Encoding for API Responses」。

You can now enable content encoding support for API Responses in Amazon API Gateway. Content encoding allows API clients to request content to be compressed before being sent back in the response to an API request. This reduces the amount of data that is sent from API Gateway to API clients and decreases the time it takes to transfer the data. You can enable content encoding in the API definition. You can also set the minimum response size that triggers compression. By default, APIs do not have content encoding support enabled.

打開後傳回的資料就會自動壓縮了,然後還可以設定觸發的 response size... 依照文件 (Content Codings Supported by API Gateway),目前支援的壓縮格式應該是最常見的 gzipdeflate

這功能好像是一開始有 API Gateway 就一直被提出來的 feature request...

Amazon Redshift 支援 Zstandard

Amazon Redshift 支援 Zstandard 壓縮資料:「Amazon Redshift now supports the Zstandard high data compression encoding and two new aggregate functions」。

Zstandard 是 Facebook 的人發展出來的壓縮與解壓縮方式,對比的對象主要是 zlib (或者說 gzip),官網上有不少比較圖。目標是希望在同樣的壓縮處理速度下,可以得到更好的壓縮率。

Redshift 支援 Zstandard 等於是讓現有使用 gzip 的使用者免費升級的感覺...

CloudFront 總算支援 gzip 壓縮了...

CloudFront 總算是宣佈支援 gzip 壓縮了:「New – Gzip Compression Support for Amazon CloudFront」。

cloudfront_console_compress_option_1

不過在「Serving Compressed Files」文件裡提到,CloudFront 有可能不壓縮 (居然還有這種的...):

CloudFront is busy

In rare cases, when a CloudFront edge location is unusually busy, some files might not be compressed.

另外看起來也沒辦法指定壓縮哪些 Content-Type,只能用選好的值。

CloudFlare 對 Brotli 的測試

之前有提過這件事情,由於 Firefox 已經支援 Brotli 了 (Google 推出 Brotli 無損壓縮法),所以 CloudFlare 的人整理了目前的效能比較:「Results of experimenting with Brotli for dynamic web content」。

主要還是 Brotli 拿了不少資源來換壓縮率,對於 static content 由於可以事先算好而大勝不少 (大約可以再榨出 15% 的壓縮率,從 zlib 9 的 27.7% 降到 brotli 10 的 23.3%):

The current state of Brotli gives us some mixed impressions. There is no yes/no answer to the question "Is Brotli better than gzip?". It definitely looks like a big win for static content compression, but on the web where the content is dynamic we also need to consider on-the-fly compression.

另外對於大檔案、網路速度不快的連線來說也頗有幫助,但對於 on-the-fly 的壓縮反而會比較慢。

完全關閉 HTTPS 的 gzip

在「Perceived Web Performance – What Is Blocking the DOM?」這邊看到 Google PageSpeed Insights,突然發現我的 nginx 設定還是有打開 gzip,而目前因為 BREACH 攻擊的關係,有 cookie 的 SSL/TLS 下不應該開 gzip:

gzip off;

目前好像沒有比較好的解法可以對抗 BREACH...