The hardest part (IMHO) about getting a simple language running, particularly for beginners, is the “expression”. Mostly for your typical algebraic style infix expressions with precedence.
Just getting the grammar straight on the naturally recursive structures can be a trick. They also touch a large portion of the code generation and run time.
> The hardest part (IMHO) about getting a simple language running, particularly for beginners, is the “expression”
This is why teaching material should be tailored towards teaching rather than implementing demented stuff from the real world just because it is "there." You can easily implement a parser that ignores precedence and tell the reader that we are using brackets to force precedence instead.
In a real-word parser, you might use something like this[1] or some other home-grown algorithm if you didn't know about it. Doesn't matter as long as the parser works.
Is that really the hard part? I figured that part out before I knew much programming: made a simple Excel type formula parser. (A horrible, horrible solution based around replacing strings and manually counting parentheses... but it worked!)
I never got much farther than that though, I've looked into making a compiler many times and got overwhelmed every time.
Good to know I wasn't alone in writing terrible expression parsers. Since I knew absolutely nothing about parsing, my parser/interpreter consisted of splitting on whitespace, followed by linear scanning to find the most high precedence operation, repeated until a single token remains (basically how a human would do it).
I think the hardest part is function calls. Implementing the standard calling convention is really fiddly and annoying when you have to save and restore registers around function calls if you want to do it efficiently (keeping track of which registers are dirty, doing parallel loads of registers into the right argument registers, spilling when appropriate, etc. It is fiddly.
The alternative is going to an IR and doing full register allocation but then you need to implement Lengauer-Tarjan to get into SSA, all the same parallel load stuff for phi functions/block arguments, out-of-SSA, reconstruct in optimisation passes all the information you discard by going to a linear IR, etc.
I'd argue that Lengauer-Tarjan is overrated. While it is neat, it's premature optimization IMHO. Fully maximal SSA (at the entry of each block insert a phi-node for each variable) + dead code elimination (we need it anyways) is much simpler.
Typical implementations of Lengauer-Tarjan are often taken verbatim from Andrew Appel's book and involve higher constant factors than alternative algorithms - such as Cooper et al's "engineered" version of the usual fixpoint algorithm, which is very neat, simple, and performs well.
If you are saying that constructing SSA following the classical approach is a premature optimisation, perhaps you would prefer Braun et al's retelling of Cliff Click's SSA construction algorithm - which works backwards from uses to place phis and requires no dominance information to be computed.
Why not fully maximal SSA + DCE? I mean, other than timings consideration.
Fully maximal does not even need any graph traversal, it is entirely local to each block.
I think it will introduce too many redundant phis, but I've never used it in practice - so I can only speculate. I'm not convinced DCE will clean maximal SSA up substantially. Even the classic Cytron et al algorithm must be combined with liveness analysis to avoid placing dead phis (that is, it does not produce pruned SSA by default). In the past, there was always a fear that SSA has the potential to cause quadratic blowup in the number of variables - this concern is mostly theoretical but probably influenced some of the design decision around algorithms for constructing SSA.
Braun et al's algorithm works backwards from uses (which generate liveness), so you get pruned SSA out. In the case of reducible control flow graphs, you also get minimal SSA. This is all without any liveness or dominance computation beforehand. Granted, you may want those things later, but it's nice that you can construct a decent quality SSA with a fairly intuitive algorithm. Also shown in the paper is that you can incorporate a few light optimisations during SSA construction (constant folding, for example).
I'll definitely test maximal SSA + DCE in tinyoptimizer when I'll get to it. For the moment I made [1] somewhat-pruned mem2reg pass, not even sure how to call it. But it indeed required to compute dominance frontiers.
I'm always happy to see more accessible resources for compiler writers.
---
As an aside: for displaying CFGs on the page, it would be very interesting to emit something somewhat dynamic. SVGs are always a good start, but there is a neat library for doing hierarchical graph layout (dagre, with d3-dagre handling rendering as well). In my own efforts at pedagogy, I've been interested in producing CFGs in-browser whose basic blocks comprise a "unified diff" view of the block (this being achieved in realtime by maintaining a subset of LLVM whose CFGs are persistent). Then it is more obvious what has changed: at least in the case of mem2reg which shouldn't introduce new blocks or move too much around (I forget if it hoists allocas to the entry block or not).
It'd also be cool to distil what underlying ideas you have found to be most useful in your efforts. The constrained scope of them may be useful to me, as I've wanted to create a kind of "advent of compilers" for years (advent of code but with a heavy slant towards compiler tasks/algorithms).
well, in my personal experience, parsing is the easiest part. it's what you do with the AST once you have one that is the hard part. but for some reason an overwhelming amount of writing is about frontends when you could teach the concepts to a layman with some boxes and arrows.
If all you do is construct an AST and interpret it it's not really a compiler is it. It doesn't compile from source to target. At most you're describing a compiler front end (arguably), or an interpreter.
I would expect:
lexer + parser + AST definition and construction + semantic analysis/type checking + X + codegen to ASM/WASM/C
where X includes
- definition of an intermediate representation (IR)
- lowering AST to IR
- static analysis
- improvers/optimisation passes (at least some simple stuff)
- code gen including register allocation
This can still be a simple "complicated" program, but there's more to a compiler than an AST interpreter.
> If all you do is construct an AST and interpret it it's not really a compiler is it.
What if I dumped the AST to disk, called it "bytecode" and updated the "interpreter" to run this "bytecode"? Would you consider that to be a compiler? :-)
> where X includes
I don't consider any of these things to be essential. There are entire languages out there which run on interpreters of some kind. There are others which compile down to C/JS. Many compilers use LLVM as their backend. By your definition, none of them use (or are) compilers.
:) No. Under my definition there needs to be some non-trivial transformation into or out of an intermediate and/or target representation (either syntax-directed directly out of the parser, or from a materialized AST). Personally I would argue that even if you trivially "compiled" to sequential bytecode, if the bytecode is then interpreted it is hard to argue that you have created a compiler (the pre-JIT cpython interpreter is still an interpreter, even though it includes a bytecode representation). But I can see that this point can be argued, and historically a school exercise of writing a Pascal-to-pcode converter would be called a compiler, so sure, you can take that perspective if you like. Just don't get confused about whether you are learning to build a compiler (definition 1) or a compiler (definition 2).
I have written all kinds of compilers, interpreters and vms over the last 15 years. A couple of them for work. The rest for fun. Some vms used RC, others did mark-and-sweep GC. Some vms even did JIT. A lot of them were simple treewalk interpreters because the point was to play with the syntax of the language.
Based on that experience, I would say your definition is more academic than realworldly as javac (or any compiler targeting the JVM) is not a real compiler according to it.
> javac (or any compiler targeting the JVM) is not a real compiler according to it
I think it depends on how you interpret my "non-trivial transformation into or out of an intermediate and/or target representation". For example if javac involves IR, SSA based analysis and transformation (which I assume it does) then that would be a non-trivial transformation and I'd call it a compiler. If on the other hand it was a direct syntax-directed transformation to java byte code then I'd call it a translator not a compiler. But I'm not claiming any authority over definitions here. Words aren't my strong suit.
I am fairly certain that it does not. The JVM is a stack-machine.
I wrote a compiler for a simple programming language that compiled down to JVM bytecode for a project in the early 2010s. Its output for test programs was as fast as the comparative Java ones because they were fairly similar at the bytecode level.
The HotSpot JVM is where all the optimization effort is applied.
I believe there’s some trivial optimization done by the frontend (constant folding, simplifying string concatenation, converting switch statements to lookup tables) but I think they’d agree with you that javac doesn’t really fit their definition of a compiler and that the HotSpot JIT is the true compiler based on how they talked about the Python interpreter pre-JIT.
I strongly disagree with your definition. A naive syntax-directed translation to bytecode or machine code is still compilation. Lots of production compilers do exactly that eg Wirthian compilers.
javac does very, very few things. It intentionally emits byte code close to the source program. The HotSpot compiler in the JVM does the heavy lifting.
Honestly, antlr made this pretty straightforward to me. I didn't want to work in java, but they have a ton of targets. You can definitely write it all yourself, and that's a great learning exercise. But I wanted to get a parser for a language idea I had in mind and it took a couple of days with antlr (https://github.com/chicory-lang/chicory)
If you know basic programming (js/python), a day is more than enough to get the concept across using a tiny language. Could probably be done in an hour.
The problem, always, is people are given terrible reading recommendations which makes the subject more complicated than it is.
I never took a compilers course. The first reading recommendation I got was The Dragon Book. Perhaps not the worst starting place at the time (mid 90s) but my memory of it was that the book was front-loaded with automata and parser theory. At the time I didn't really make it past the front end. It would be interesting to see a compiler tutorial that started at codegen and worked backwards (or started at the ends: lexing and codegen, and worked towards the middle in alternating chapters).
The Dragon Book is terrible as an introduction. There are better books I would probably recommend, but not to a beginner. The best "complete" intro out there today is Nystrom's Crafting Interpreters.[1]
> There are better books I would probably recommend
I'm curious what you'd recommend.
For what it's worth my goal was to compile to machine code. Anything less would have seemed insufficient. Later I got Appel's "Modern Compiler Implementation in Java" and Allen and Kennedy "Optimizing Compilers for Modern Architectures".
Cooper and Torczon "Engineering a Compiler" was recommended here recently. I haven't seen it.
I don't think that the ghuloum 2006 paper was the initial basis for nanopass,
at least the paper A Nanopass Framework for Compiler Education∗ seems to predate it in 2004
https://dl.acm.org/doi/10.1145/1016848.1016878
I got started from the books that were available here in the early 2000s. The one I like the most is "Programming Language Pragmatics" by Michael Scott, mainly for its simplicity. Then there is "Advanced Compiler Design and Implementation" by Steven Muchnick. I have Allen-Kennedy's book too, but I think it went over my head when I went through it. So I kept it aside.
"Engineering a Compiler" is quite approachable.
But, once you know the basics, you get the best bang-for-the-buck by looking at source code of various compilers and virtual machines.
> It would be interesting to see a compiler tutorial that started at codegen and worked backwards (or started at the ends: lexing and codegen, and worked towards the middle in alternating chapters)
This is exactly how my university's compiler course did it[1]. It was really nice to start with assembly and go upwards. We had all the lexing/parsing stuff being discussed around the middle of the semester rather than being saddled with theory-heavy regex, automata theory, etc right at the beginning.
Production compilers are complicated because of the feature set of languages and performance requirements.
You can write a lexer + parser + treewalk interpreter for a simple language in a day if you know what you are doing.