> It is then further optimized with wasm-opt -Oz --enable-bulk-memory bringing the total down to 2.4 MiB. Finally, it is compressed with zstd, bringing the total down to 637 KB. This is offset by the size of the zstd decoder implementation in C, however it is worth it because the zstd implementation will change rarely if ever, saving a total of 1.8 MiB every time the wasm binary is updated.
Is the goal here to save space in the Git repo, by compressing before committing?
I wouldn't assume using zstd is necessarily worth the complication. It could even make things worse.
As I understand it, Git stores objects in packfiles[1], and these are both delta-fied and compressed with zlib.
Your zstd reduces the 2.4MiB .wasm file to 637K. But Git's zlib should reduce 2.4MB to 800K (according to a quick test I just did). So at best, you only save 163K, not 1.8 MiB.
But if Git's delta-fication works, you may actually use more space.
Git should try to use its binary diff algorithm[2] to compare your various committed versions of zig1.wasm. If that algorithm is effective against Wasm files (my guess is yes), it will be able to store one version as a full copy and other versions as (somewhat? much?) smaller deltas against the full one.
If you store .wasm.zst files, since compression tends to obscure commonalities, my guess is Git won't be able to do deltas and will have to store full copies of every version.
On a side note, Git is said to be bad at handling binaries, and that's somewhat true, but there's some nuance. Binary files get in the way of easy branching and merging because Git can't merge them. So Git is bad at binary files in that way, but that's not relevant here. Also a lot of binary formats (like JPEG) are very much not amenable to binary diff, but my bet is that's not relevant here either.
Actually zstd makes that worse too, somewhat paradoxically. At least in this case, because Zig uses xz for their tarballs. (If they used gzip, it would be the other way around.)
The reason is that compression algorithms usually can't make further reductions when re-compressing already-compressed files. And xz has a higher compression ratio than zstd, so when you stick zig1.wasm.zst into a tar.xz file, xz is deprived of the opportunity to work its more powerful magic.
As a test, I got zig-0.11.0-dev.638+5c67f9ce7.tar.xz from https://ziglang.org/download/ , extracted it, and rebuilt the tar.xz myself. Then I replaced stage1/zig1.wasm.zst with stage/zig1.wasm and rebuilt the tar.xz again.
So, zig.orig.tar is the uncompressed tarball that contains zig1.wasm.zst, and it is indeed smaller than zig.new.tar. But the .tar.xz files are the other way around.
Not using zstd saves 68K.
=-=-=
Also, in the process, I accidentally discovered something else that makes a bigger difference.
Since I knew the order of files within a tar archive can affect the compression ratio (due to data locality), while doing my test, I used "tar tf" to list my tar file's contents and compare it with what I downloaded. It didn't match, so I knew I wasn't doing an apples to apples comparison.
So I added "--sort=name" to my tar commands. And both of my tar files ended up smaller than the one I downloaded:
$ du -sk zig-0.11.0-dev.638+5c67f9ce7.tar.xz
15152 zig-0.11.0-dev.638+5c67f9ce7.tar.xz
Just adding the "--sort=name" option to tar saves 584K! That's around 4% of the entire tar file. Locality matters more than I thought.
Is the goal here to save space in the Git repo, by compressing before committing?
I wouldn't assume using zstd is necessarily worth the complication. It could even make things worse.
As I understand it, Git stores objects in packfiles[1], and these are both delta-fied and compressed with zlib.
Your zstd reduces the 2.4MiB .wasm file to 637K. But Git's zlib should reduce 2.4MB to 800K (according to a quick test I just did). So at best, you only save 163K, not 1.8 MiB.
But if Git's delta-fication works, you may actually use more space.
Git should try to use its binary diff algorithm[2] to compare your various committed versions of zig1.wasm. If that algorithm is effective against Wasm files (my guess is yes), it will be able to store one version as a full copy and other versions as (somewhat? much?) smaller deltas against the full one.
If you store .wasm.zst files, since compression tends to obscure commonalities, my guess is Git won't be able to do deltas and will have to store full copies of every version.
On a side note, Git is said to be bad at handling binaries, and that's somewhat true, but there's some nuance. Binary files get in the way of easy branching and merging because Git can't merge them. So Git is bad at binary files in that way, but that's not relevant here. Also a lot of binary formats (like JPEG) are very much not amenable to binary diff, but my bet is that's not relevant here either.
---
[1] See:
https://git-scm.com/docs/git-pack-objects
https://git-scm.com/docs/pack-format
https://git-scm.com/book/en/v2/Git-Internals-Packfiles
[2] "inspired by" LibXDiff, according to https://github.com/git/git/blob/master/diff-delta.c