查資的時候 gzip 發現有 --rsyncable
這個參數,號稱是產生出對 rsync 友善的壓縮檔:
When you synchronize a compressed file between two computers, this option allows rsync to transfer only files that were changed in the archive instead of the entire archive. Normally, after a change is made to any file in the archive, the compression algorithm can generate a new version of the archive that does not match the previous version of the archive. In this case, rsync transfers the entire new version of the archive to the remote computer. With this option, rsync can transfer only the changed files as well as a small amount of metadata that is required to update the archive structure in the area that was changed.
這個參數的說明可以參考「Rsyncable gzip」這篇,從發表的日期是 2005 年就可以看出來這個參數已經很久了:
With this option, gzip will regularly “reset” his compression algorithm to what it was at the beginning of the file. So if for example there was a change at byte 23, this change will only affect the output up to maximum (for example) byte #9999. Then gzip will restart ‘at zero’, and the rest of the compressed output will be the same as what it was without the changed byte 23. This means that rsync will now be able to re-synchronise between the old and new compressed file, and can then avoid sending the portions of the file that were unmodified.
這個參數的想法是,正常狀態下的 gzip 會因為來源的微小改變,造成後續壓縮的內容都完全不一樣。
但加上 --rsyncable
後,gzip 就會定時重設壓縮狀態 (reset),於是讓壓縮後的輸出內容有大部分的內容重複,於是 rsync 就能夠偵測到相同內容而避免大量重傳。
但反過來的缺點面也就馬上可以想到,這是犧牲一些壓縮演算法的效率,付出的代價就是輸出的檔案會大一點。
我拿 zstd -19
(zstd 最高的壓縮率?) 測試 BBS 的備份,一般壓縮是 513672097 bytes,而加上 --rsyncable
後的壓縮是 513831761 bytes,發現是萬分之幾的增加,等於是只多了零頭...?
看起來會是蠻好用的參數,特地寫一篇記錄起來...