For wayback machine, are those compressed, deduplicated numbers?
A semi-popular domain can have millions of results on their CDX api, but with https/https duplicated and about 90% of results are error pages or pages with deliberate garbage / LFI attempts in them.
Deduplication is not trivial. Each scrape is stored in a WARC archive, so you would have to unpack several large files, dedupe, and then pack them back up again. I believe they are at least compressed within each snapshot though.
Yes, that seems to be a silly way to go about it if your goal is to store the whole web and not just a single scrape. Of course anything that deduplicates data is more vulnerable to data corruption (or at least corruption can have wider consequences) so it's not a trivial problem but you'd think deduplicating identical resources would be something added the first time they came close to their storage limits.