Full Text RSS Feed: Get the whole feed and nothing but the feed

nicpottier · on March 2, 2011

We built a backend similar to this for our NewsRoom mobile client. (Android and Pre) Actually used some genetic algorithms to do the training for our content extraction, one of the more fun projects I've done.

Word of warning, if it takes off, you basically start turning into someone who is both caching and harvesting the web every 15 minutes. There is an incredibly long tail on RSS feeds and it starts killing you to keep them all up to date. Storing and serving it is no big deal, but harvesting actually turns into real money when you figure out total bandwidth used. (we harvest about ~30,000 feeds every 15 minutes)

tianyicui · on March 2, 2011

I really wish one day Google will sell a page harvesting service. It can certainly profit since the cost is neglectable. But how big the market is?

lurchpop · on March 2, 2011

how do you handle scale like that? Do you have a hadoop cluster or something? How many concurrent do you download?

nicpottier · on March 2, 2011

Amazon EC2, MongoDB, S3. The EC2 instances scale with how many stale feeds we have, but it is usually less than 2.

Just checked and we have ~25k feeds in the system, though not all are deep harvesting as we call it.

Note we do a few things over just extracting the full content as well, we also try to grab out images and create a pleasing thumbnail using face detection etc.. So that probably slows things down a good deal as well.

geuis · on March 1, 2011

Could you darken up on the grey a bit? Grey on white on grey isn't exactly easy to read.

jonkelly · on March 2, 2011

Could be a way to slow down the feed owners' lawyers a bit? ;-)

timrosenblatt · on March 2, 2011

lol.

What's the difference between a Cease & Desist letter from a big company, and free advertising for a startup?

aaroneous · on March 2, 2011

Is that better?

geuis · on March 2, 2011

Gratzi! Much easier to read now.

timrosenblatt · on March 2, 2011

how's it look now?

grayrest · on March 1, 2011

What's he using to pull out the articles? I had a hacky version set up using the Readability algorithm but never bothered to make it public.

yesimahuman · on March 2, 2011

Boilerpipe is by far the best tool for this that I've ever found (http://code.google.com/p/boilerpipe/). I'd be interested to hear if he is using something better, but I'd be surprised if he is.

I think this is a great idea and very similar to a lot of stuff I have worked on recently. It's cool to see so much interest in these text-related services.

nikcub · on March 2, 2011

Thanks for that link - exactly what I was looking for

btw I know that at Techmeme, Gabe spent years perfecting his story parsing for the 50k+ sites he tracks. Even something that would seem simple such as parsing the date of a story from a webpage has a ridiculous number of permutations that you have to grep for.

benmccann · on March 2, 2011

I don't think it's quite as good as what he's doing though. He has the title and date specifically pulled out and he doesn't have any extra text included. I think he manually handles CNN. If I try a HuffPost feed it doesn't work at all.

yesimahuman · on March 2, 2011

Yea, I'd be curious to see exactly what he's doing. I can only guess there is a heuristic which results in a lot of failed feed processing noticed on here (I know it's just a weekend project :)) that doesn't generalize well. Boilerpipe, in my experience, works very well on almost all news/blog type content. Finding the date in the first few sentences and the title are extra heuristics that can be added later.

EDIT: The date and title are in the RSS feed already! No further analysis needed.

osmandros · on March 2, 2011

That library is very robust. On the company I was working last year, We built a news crawler using that tool, and adapted it as a plugin for nutch.

gregorym · on March 2, 2011

That is freaking awesome! http://boilerpipe-web.appspot.com/

beagledude · on March 2, 2011

Goose article extractor has a full suite of unit tests and also does pure text and image extractions: https://github.com/jiminoc/goose

k1m · on March 3, 2011

Possible to do it with Readability - my PHP port is here: http://www.keyvan.net/2010/08/php-readability/ - and similar tool (free software, can be self-hosted) using the PHP Readability here: http://fivefilters.org/content-only/

spidaman · on March 1, 2011

Does this work for anybody? I've plugged in 3 feeds, one was "unable to retrieve full-text content" (an sfgate.com feed) and the other two returned nothing at all in the preview (one a feed from kqed.org, the other an older wordpress blog).

guptaneil · on March 2, 2011

The preview for Lifehacker returned nothing at all, but adding the feed to Google Reader worked as advertised. I guess, don't rely on the preview box.

matsur · on March 2, 2011

This may be common knowledge, but all gawker blogs are available in full feed, ad free form at <gawker entity>.com/vip.xml

i.e. http://lifehacker.com/vip.xml

yahelc · on March 2, 2011

Ah, I see I'm not the first to point this out :)

ericgs · on March 1, 2011

Funny, I just added sfgate too and it worked for me: fulltextrssfeed.com/www.sfgate.com/rss/feeds/news.xml

spidaman · on March 2, 2011

Hrm, that one loads fine... but fulltextrssfeed.com/www.sfgate.com/rss/feeds/blogs/sfgate/offtherecord/index_rss2.xml fails with "unable to retrieve full-text content" and fulltextrssfeed.com/www.kqed.org/rss/arts.xml fails with "Unable to parse this page for content."

I have a lot of experience with fetching and parsing feeds and pages, so I'm not trivializing the problem, just observing issues I'm seeing with this solution.

timrosenblatt · on March 1, 2011

works for me. did you test out the default CNN feed? does that work?

clvv · on March 2, 2011

A similar service: http://fivefilters.org/content-only/ and it is opensource too. It uses a PHP version of readability to extract the full content. Also can the author of fulltextrssfeed.com explain some of the implementation details? I was planning on a similar project with node.js, jsdom and readability.

hokkos · on March 2, 2011

Is it legal ? Can you legally copy all the content of a site and publish it while striping the ads ?

I've tough of this idea since 2 years, but I am so ineffective at building my own ideas that it doesn't surprise me that someone else built it, as the idea was really floating more and more since instapaper mobilizer.

Considering the legal aspect I had more ideas about that. It is to hide behind the DMCA takedown, and provide an email address to take-down a feed. But do not map the www.example.com/feed.xml to http://fulltextrssfeed.com/www.example.com/feed.xml , but use an alias, so the take-down just remove the alias not the whole * .example.com*.

swombat · on March 2, 2011

Immediate swap of current PG essays feed for:

http://fulltextrssfeed.com/www.aaronsw.com/2002/feeds/pgessa...

yahelc · on March 2, 2011

Considering the impending lawyer-takedown, it would be great if this was made open source, so people can implement their own local versions on their own servers.

yagibear · on March 1, 2011

Could you also do the opposite: Take bulky feeds (e.g. http://feeds.feedburner.com/tedblog) and truncate them; showing title & first para & include a link? I use RSS primarily to scan what is available and mark some for later reading, and bulky feeds interrupt the scanning process.

tshaddox · on March 2, 2011

This would probably be trivial with Yahoo! Pipes.

http://pipes.yahoo.com/pipes/

timrosenblatt · on March 2, 2011

there's a commenter further down the page by the handle of "Roll" that says Pipes didn't work for him.

And yeah, it was built in a weekend :) But now you don't have to.

timrosenblatt · on March 1, 2011

that's a good idea. I will bring it up to him.

cvandyck76 · on March 2, 2011

Is there an argument to be made that the content providers only get 'paid' if the RSS reader is enticed to click through to the site? I'm all for neat services, but I think that this is a little bit unfair to the other party.

tuhin · on March 2, 2011

Not trying to be the show stopper here, but this is illegal right? I mean especially news sites like Reuters do create a fuss when this is done. Is that (legal drama) only in commercial projects or otherwise too?

netmau5 · on March 1, 2011

Nice, this will come in very useful for an RSS-based project I'm working on too. Hopefully I won't slam your servers too hard. Are you considering making the source available?

aaroneous · on March 2, 2011

I wasn't expecting much interest in it, but I'd be happy to clean it and package it up if you guys want to play.

shadowpwner · on March 2, 2011

Yes please.

timrosenblatt · on March 1, 2011

it's not mine, it's a project a friend threw together over the weekend. It's on a shared host, but I'm trying to help light the server on fire so he puts it on something more heavy duty. :)

pak · on March 2, 2011

This is nice but what's the difference from ViewText (http://www.viewtext.org)? ViewText has a JSONP API, which made it perfect for building into a recent little project I did (it was a web app). Plus, it's been around for a lot longer.

ericgs · on March 2, 2011

Just tried viewtext on lifehacker's feed and got:

"We understand you'd like to delete your account. If you delete your account all of your information including your comments, messages, posts, and friends and followers associations will be removed from our system. Please consider the following options before clicking delete."

Yikes! =X

ianvanness · on March 2, 2011

Give the full feeds that Lifehacker (et al) already offer: http://lifehacker.com/vip.xml (this article to it came up in a google search - http://lifehacker.com/5489210/)!

(also, that needs to be fixed asap, lest anyone get the wrong idea)

AdamGibbins · on March 1, 2011

Excellent thanks, shall be applying this to all my Gawker feeds.

yahelc · on March 2, 2011

You know they make a full-feed version available of all their sites, right? It's just of the form http://lifehacker.com/vip.xml

gnosis · on March 2, 2011

##sigh##

Yet another service that requires me to hand over information on what I read.

Why couldn't this be made as a privacy-respecting application I can run from my own machine?

roll · on March 1, 2011

interesting project. I was doing a similar thing with yahoo pipes, but it got blocked because of robots.txt. What do you do about it?

shadowpwner · on March 2, 2011

You can always disregard robots.txt.. ;)

palak55 · on March 6, 2011

The same service is offer by www.getrss.in and i am happy client of them for more then 8 months

palak55 · on March 6, 2011

http://www.getrss.in

timrosenblatt · on March 1, 2011

Nice. Grabs the whole article text so you don't have to leave the RSS reader.

timrosenblatt · on March 1, 2011

Lol: http://fulltextrssfeed.com/news.ycombinator.com/rss

Works well for keeping up with HN too :)

boctor · on March 1, 2011

Why do a couple of articles in that feed say either "Unable to parse this page for content" (daringfireball.net) or "unable to retrieve full-text content" (nytimes.com)?

timrosenblatt · on March 1, 2011

it's still just a weekend project, but if everyone keeps throwing love at it like this, it might be more than a weekend project. he's working on it now. :)

dholowiski · on March 1, 2011

The content thieves will love it too. This makes it much easier to automatically copy content.

getsat · on March 2, 2011

Not really. I can Right Click -> Copy XPath in Firebug's element inspector then just Nokogiri::HTML(page_source).xpath('/blah') to get at it. You can do it with the CSS selector as well. :)

Setting up a quick script to rip all the content from another site is trivial. There's also wget -m

geoffw8 · on March 2, 2011

I literally cannot get nokogiri set up on my Mac for love nor money, I'm a noob whose been trying for a week or two. Tried everything. Its preventing me from running tests. Damn xmllibs2.

getsat · on March 2, 2011

Install MacPorts: http://www.macports.org

Add /opt/local/bin to your PATH (bashrc or zshrc or whatever you use).

sudo port install libxml2 and sudo port install libxslt

Then sudo gem install nokogiri --no-rdoc --no-ri should run with no issues. That's all I had to do for the system ruby (1.8.7 on OSX 10.6) and 1.9.2 via rvm.

geoffw8 · on March 4, 2011

So it turned out it was webrat that was the problem, but your instructions actually fixed the problem! I'd researched for hours previously. Genuinely much appreciated!

getsat · on March 4, 2011

Glad I could be of assistance!

geoffw8 · on March 4, 2011

Thanks, just saw this. I'll give it a go now, much appreciated.

evilduck · on March 2, 2011

Do you have XCode installed? Nokogiri is a native extension, I don't think you can build them without the compiler tools installed.

geoffw8 · on March 4, 2011

Yes I do. I'll look into this, thanks!

roll · on March 2, 2011

I wonder how long will it take till sites like lifehacker and heise.de will start blocking this service...

adrianwaj · on March 2, 2011

Is there a time delay between the source feed and the full feed?

austintaylor · on March 2, 2011

Works great! Nice to have: concatenation of multi-page articles.

sankara · on March 2, 2011

Works with BBC. Thanks.

mariuskempe · on March 1, 2011

Thank you so much! :-)

timrosenblatt · on March 2, 2011

you're welcome!

andy_mason · on March 2, 2011

Just wanted to second the thanks. I read a lot more of authors content now.

irfn · on March 2, 2011

nice! how exactly does this generate the full text version?

iphoneedbot · on March 2, 2011

I love it! It works for tumblr rss -- I really wish though that it you can opensource it. (Well, I would just hate it if you start having hosting problems or other problems that would cause you the need to shut down)

Im currently using "Readable Feeds" Nirmal J. Patel (http://www.nirmalpatel.com/hacks/hnrss.html) and Andrew Trusty (http://andrewtrusty.com/2009/06/29/readable-feeds/)

I like it -- but its really inconsistent!

Cheers,