For wayback machine, are those compressed, deduplicated numbers? A semi-popular ...

smallerize · on Sept 5, 2024

Deduplication is not trivial. Each scrape is stored in a WARC archive, so you would have to unpack several large files, dedupe, and then pack them back up again. I believe they are at least compressed within each snapshot though.

account42 · on Sept 6, 2024

Yes, that seems to be a silly way to go about it if your goal is to store the whole web and not just a single scrape. Of course anything that deduplicates data is more vulnerable to data corruption (or at least corruption can have wider consequences) so it's not a trivial problem but you'd think deduplicating identical resources would be something added the first time they came close to their storage limits.