LXML also is known to have memory leaks [0][1], so be careful using it in any kind of automated system that will be parsing lots of small documents. I personally encountered this issue, and actually caused to abandon a project until months later when I found the references I linked above. It works nice and fast for one-off tasks, though.
Also, a question: how often do you really encounter badly-formed markup in the wild? How hard is it really to get HTML right? It seems pretty simple, just close tags and don't embed too much crazy stuff in CDATA. Yet I often read about how HTML parsers must be "permissive" while XML parsers don't need to be. I've never had a problem parsing bad markup; usually my issues have to do with text encoding (either being mangled directly or being correctly-encoded vestiges of a prior mangling) and the other usual problems associated with text data.
Also, a question: how often do you really encounter badly-formed markup in the wild? How hard is it really to get HTML right? It seems pretty simple, just close tags and don't embed too much crazy stuff in CDATA. Yet I often read about how HTML parsers must be "permissive" while XML parsers don't need to be. I've never had a problem parsing bad markup; usually my issues have to do with text encoding (either being mangled directly or being correctly-encoded vestiges of a prior mangling) and the other usual problems associated with text data.
[0]: https://benbernardblog.com/tracking-down-a-freaky-python-mem...
[1]: https://stackoverflow.com/q/5260261