Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I hate that XHTML went away. HTML parsing is terrible


The story of XHTML is instructive to the field of software design. There are plenty of good resources on the web if you search why did XHTML fail?

HTML parsing at least is deterministic and fully specified, whereas XHTML, as an XML, leaves a number of syntax errors up to the parser and undefined.

  Conforming software may detect and report an error and may recover from it.
While fatal errors should cause all parser to reject a document outright, this also leaves the end-user without any recovery of the information they care about. So XHTML leaves readers at a loss while failing to eliminating parsing ambiguity and undefined behavior.

Interestingly, it’s possible to encode an invalid DOM with XHTML while it’s impossible to do so in HTML. That means that XML/XHTML has given up the possibility of invalid syntax (by acting like it doesn’t exist) for the sake of inviting invalid semantics.


Interesting perspective, it makes me miss XHTML wayyy less. I was under the impression that XHTML (XML) was better specified and had less weirdness. I know HTML is now better specified but some of the things inherited from HTML 4 and before make no sense to me (optional closing times SOMETIMES, optional stuff everywhere).


HTML didn’t make sense to me until I realized it’s built on a state machine and its rules are based on what’s on the stack of open elements. For example, a number of tags trigger a rule to close open P elements or list items, and many end tags trigger a rule saying something like “close open elements until you’ve closed one with the same name as this tag.”

This, IMO, is a bigger reason to avoid regex and XML parsers for HTML documents. The rules aren’t apparent when thinking linearly about what strings appear after or before each other; they become clearer when thinking of HTML as a shorthand syntax for certain kinds of push and pop operations.

XHTML is easier to parse, but for well-formed documents pushes the complexity of invalid markup into the rendering side. For example, it’s well-formed to include a button inside a button, so XHTML browsers render exactly this, but it makes no sense from a UI perspective and strange things happen when invalid markup is sent in well-formed XML.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: