It may not be related but I also noted that processing HTML with lxml (e.g. update every URL of a HTML document with a different domain for instance) was producing malformed HTML with duplicated tags. So I would recommend to use lxml only as a data extraction tool.