Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

For wayback machine, are those compressed, deduplicated numbers? A semi-popular domain can have millions of results on their CDX api, but with https/https duplicated and about 90% of results are error pages or pages with deliberate garbage / LFI attempts in them.


Deduplication is not trivial. Each scrape is stored in a WARC archive, so you would have to unpack several large files, dedupe, and then pack them back up again. I believe they are at least compressed within each snapshot though.


Yes, that seems to be a silly way to go about it if your goal is to store the whole web and not just a single scrape. Of course anything that deduplicates data is more vulnerable to data corruption (or at least corruption can have wider consequences) so it's not a trivial problem but you'd think deduplicating identical resources would be something added the first time they came close to their storage limits.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: