callorico's comments

callorico · on June 16, 2010

To expand slightly on the above: I'm using BeautifulSoup to parse out the text fragments in the html doc. Then, for each fragment, a set of features are extracted. This is really the secret sauce part of things. The features are looking at things like, how many words are in the text fragment (ingredients tend to be fairly short). The features are fed into a classifier (I'm using the NLTK library for this) which outputs a yes/no label on whether or not this is an ingredient.

Let me know if you've got any NLP or machine learning experience. I'd love to bounce some ideas off of you.

bryanh · on June 17, 2010

Thanks for the info, I have little NLP experience but have been diving into machine learning and simple neural nets. What are your thoughts on NLTK? I've yet to use it.

callorico · on June 16, 2010

Thanks! The ingredient parsing is definitely not 100% accurate (it falls down completely on several sites). That ingredient screen that appears after submitting a url was supposed to be used for error correction. Users can quickly delete and edit ingredients before adding it to their list. You are right though, it does add an extra step before you can add another recipe url.

I'm currently using a naive bayes classifier for the ingredient extraction with some hand-crafted features. There's also an augmenting step that tries to exploit ingredient locality in the recipe doc to pick out ones that might have been initially missed.