Amazon EC2, MongoDB, S3. The EC2 instances scale with how many stale feeds we have, but it is usually less than 2.
Just checked and we have ~25k feeds in the system, though not all are deep harvesting as we call it.
Note we do a few things over just extracting the full content as well, we also try to grab out images and create a pleasing thumbnail using face detection etc.. So that probably slows things down a good deal as well.
Just checked and we have ~25k feeds in the system, though not all are deep harvesting as we call it.
Note we do a few things over just extracting the full content as well, we also try to grab out images and create a pleasing thumbnail using face detection etc.. So that probably slows things down a good deal as well.