Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Would you recommend it for scalable projects ? Like, crawl twitter or tumblr ?


Yes. It beats building up your own crawler that handles all the edge cases. That said, before you reach the limits of scrapy, you will more likely be restricted by preventive measures put in place by twitter(or any other large website) to limit any one user hogging too much resources. Services like cloudflare or similar are aware of all the usual proxy servers and such and will immediately block such requests.


So how to do it ? You have to become google/bing ?


One approach, that is commonly mentioned in this thread is to simulate a behavior of a normal user as much as possible. For instance rendering the full page (including JS, CSS, ...) which is far more resource intensive than just downloading the HTML page.

However if you're crawling big platforms, there are often ways in that can scale and be undetected for very long periods of time. Those include forgotten API endpoints that were build for some new application that was dismissed after a time, mobile interface that taps into different endpoints, obscure platform specific applications (e.g. playstation or some old version of android). Older and larger the platform is, the more probable is that they have many entry points they don't police at all or at least very lightly.

One of the most important rules of scrapping is to be patient. Everyone is anxious to get going as soon as they can, however once you start pounding on a website, consequently draining their resources, they will take measures against you and the whole task will get way more complicated. Would you have the patience and make sure you're staying within some limits (hard to guess from the outside), you will be eventually able to amass large datasets.


some "ethical" measures may do the trick to. scrapy has a setting to integrate delays + you can use fake headers. Some sites are pretty persistent with their cookies (include cookies in requests). It's all case by case basis


I had just spawned like 20 servers for a couple days on aws, but that was for a one-off scrape of some 4 million pages.


I've used it for some larger scrapes (nothing at the scale you're talking about, but still sizeable) and scrapy has very tight integration with scrapinghub.com to handle all of the deployment issues (including worker uptime, result storage, rate-limiting, etc). Not affiliated with them in any way, just have had a good experience using them in the past.


Every `hosted/cloud/saas/paas` goes into bazillions $$$ for anything largescale. Starting from aws bandwidth and including nearly every service on this earth.


I would hazard a guess that nearly all large scale use cases are negotiating those prices down quite a bit.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: