把文字檔壓成 PNG 跟 Gzip 比較壓縮率

Hacker News 上看到「Compressing Text into Images (shkspr.mobi)」這個奇怪的方法,是把文字轉成灰階的 PNG 檔案,然後跟 ZIP 比較壓縮率,居然是能打的。原文則是在「Compressing Text into Images」這邊。

這是屬於週末奇怪的 side project,作者也覺得這樣很白爛 XDDD

(This is, I think, a silly idea. But sometimes the silliest things lead to unexpected results.)

作者是拿羅密歐與茱麗葉的小說當資料 (Romeo and Juliet),這可以在 Project Gutenberg 的網站上下載到 txt 檔案。

基本的想法是把每個文字轉成灰階圖片的像素,然後用 PNG 無損輸出,最後再用 Squoosh 處理:

The English language and its punctuation are not very complicated, so the play only contains 77 unique symbols. The ASCII value of each character spans from 0 - 127. So let's create a greyscale image which each pixel has the same greyness as the ASCII value of the character.

結果居然是跟一般的壓縮軟體差不多:

That's down to 55KB! About 40% of the size of the original file. It is slightly smaller than ZIP, and about 9 bytes larger than Brotli compression.

我試著驗證作者的想法,用 Go 寫了個小程式輸出操作圖片,產稱一張寬度是小說的 byte 數,高度只有 1 的 PNG 檔輸出,最後再用 pngcrush 處理:「gslin/text-to-image-compression」。

另外我是與 gzip 比較 (原作者是與 ZIP 比較),結果是:

PNG: 64379
Gzip: 64563
Gzip (-9): 64331

的確是差不多的等級?

zlib、gzip 與 zip 的關係 (Mark Adler 的回覆)

Hacker News 上看到「How are zlib, gzip and zip related? (stackoverflow.com)」這個討論,原文是 StackOverflow 上 2013 年時的問題與回答:「How are zlib, gzip and zip related? What do they have in common and how are they different?」。

short form 算是解釋了這三個的關係,另外多帶出了其他相關的格式以及演算法:

.zip is an archive format using, usually, the Deflate compression method. The .gz gzip format is for single files, also using the Deflate compression method. Often gzip is used in combination with tar to make a compressed archive format, .tar.gz. The zlib library provides Deflate compression and decompression code for use by zip, gzip, png (which uses the zlib wrapper on deflate data), and many other applications.

而 long form 則是描述了整段歷史,完整到下面的 comment 有人提到希望能夠提供一些引用的連結,當作 reference:

This post is packed with so much history and information that I feel like some citations need be added incase people try to reference this post as an information source. Though if this information is reflected somewhere with citations like Wikipedia, a link to such similar cited work would be appreciated.

回答的作者 Mark Adler 也在 comment 上很霸氣的回:

I am the reference, having been part of all of that. This post could be cited in Wikipedia as an original source.

翻了一下他的事蹟:

老人家交代了當年的一些脈絡 (像是專利問題),會更容易理解為什麼現在發展成這樣...