Would you recommend it for scalable projects ? Like, crawl twitter or tumblr ?

sharmi · on Nov 14, 2017

Yes. It beats building up your own crawler that handles all the edge cases. That said, before you reach the limits of scrapy, you will more likely be restricted by preventive measures put in place by twitter(or any other large website) to limit any one user hogging too much resources. Services like cloudflare or similar are aware of all the usual proxy servers and such and will immediately block such requests.

ddorian43 · on Nov 14, 2017

So how to do it ? You have to become google/bing ?

doh · on Nov 14, 2017

One approach, that is commonly mentioned in this thread is to simulate a behavior of a normal user as much as possible. For instance rendering the full page (including JS, CSS, ...) which is far more resource intensive than just downloading the HTML page.

However if you're crawling big platforms, there are often ways in that can scale and be undetected for very long periods of time. Those include forgotten API endpoints that were build for some new application that was dismissed after a time, mobile interface that taps into different endpoints, obscure platform specific applications (e.g. playstation or some old version of android). Older and larger the platform is, the more probable is that they have many entry points they don't police at all or at least very lightly.

One of the most important rules of scrapping is to be patient. Everyone is anxious to get going as soon as they can, however once you start pounding on a website, consequently draining their resources, they will take measures against you and the whole task will get way more complicated. Would you have the patience and make sure you're staying within some limits (hard to guess from the outside), you will be eventually able to amass large datasets.

dataslap · on Nov 14, 2017

some "ethical" measures may do the trick to. scrapy has a setting to integrate delays + you can use fake headers. Some sites are pretty persistent with their cookies (include cookies in requests). It's all case by case basis

setr · on Nov 14, 2017

I had just spawned like 20 servers for a couple days on aws, but that was for a one-off scrape of some 4 million pages.

sklarsa · on Nov 14, 2017

I've used it for some larger scrapes (nothing at the scale you're talking about, but still sizeable) and scrapy has very tight integration with scrapinghub.com to handle all of the deployment issues (including worker uptime, result storage, rate-limiting, etc). Not affiliated with them in any way, just have had a good experience using them in the past.

ddorian43 · on Nov 14, 2017

Every `hosted/cloud/saas/paas` goes into bazillions $$$ for anything largescale. Starting from aws bandwidth and including nearly every service on this earth.

sgmansfield · on Nov 14, 2017

I would hazard a guess that nearly all large scale use cases are negotiating those prices down quite a bit.