Right, I may have forgot to mention that lexerless parsers are somewhat unusual.
I didn't have much time in the talk to go into the reason, so here it is:
- You'll need a more complex lexer to parse a shell-like syntax. For example, one common thing you do with lexers is get rid of whitespaces, but shell syntax is whitespace sensitive: "a$x" and "a $x" (double quotes not part of the code) are different things: the first is a single word containing a string concatenation, the second is two separate words.
- If your parser backtracks a lot, lexing can improve performance: you're not going back characters, only tokens (and there are fewer tokens than characters). Elvish's parser doesn't backtrack. (It does use lookahead fairly liberally.)
Having a lexerless parser does mean that you have to constantly deal with whitespaces in every place though, and it can get a bit annoying. But personally I like the conceptual simplicity and not having to deal with silly tokens like LBRACE, LPAREN, PIPE.
I have not used parser generators enough to comment about the benefits of using them compared to writing a parser by hand. The handwritten one works well so far :)
That example you gave could certainly be done in Lex/Flex and I assume other lexers/tokenizers as well, for instance, you would probably use states and have "$x" in the initial state evaluate to a different token type than "$x" in the string state.
But I do get your meaning, I've written a lot of tokenizers by hand as well, sometimes they can be easier to follow the hand written code. Config files for grammars can get convoluted fast.
But again, I was not meaning it as criticism. But your talk title does start with "How to write a programming language and shell in Go" so given the title I think Lexers / Tokenizers are worth noting.
Yeah, ultimately there's an element of personal taste at play.
The authoritative tone of "how to write ..." is meant in jest, but obviously by doing that I risk being misunderstood. A more accurate title would be "how I wrote ...", but it's slightly boring and I was trying hard to get my talk proposal accepted you see :)
I didn't have much time in the talk to go into the reason, so here it is:
- You'll need a more complex lexer to parse a shell-like syntax. For example, one common thing you do with lexers is get rid of whitespaces, but shell syntax is whitespace sensitive: "a$x" and "a $x" (double quotes not part of the code) are different things: the first is a single word containing a string concatenation, the second is two separate words.
- If your parser backtracks a lot, lexing can improve performance: you're not going back characters, only tokens (and there are fewer tokens than characters). Elvish's parser doesn't backtrack. (It does use lookahead fairly liberally.)
Having a lexerless parser does mean that you have to constantly deal with whitespaces in every place though, and it can get a bit annoying. But personally I like the conceptual simplicity and not having to deal with silly tokens like LBRACE, LPAREN, PIPE.
I have not used parser generators enough to comment about the benefits of using them compared to writing a parser by hand. The handwritten one works well so far :)