Article Notes

From Lessons from Writing a Compiler by Fernando Borretti


The prototypical compilers textbook is:

  1. 600 pages on parsing theory.

  2. Three pages of type-checking a first-order type system like C.

  3. Zero pages on storing and checking the correctness of declarations (the “symbol table”).

  4. Zero pages on the compilation model, and efficiently implementing separate compilation.

  5. 450 pages on optimization and code generation.

The standard academic literature is most useful for the extreme frontend (parsing) and the extreme backend (SSA, instruction selection and code generation), but the middle-end is ignored. This is fine if you want to learn how to build, e.g., the next LLVM: a fat backend with a very thin frontend.

Just Use Menhir You need a parser. This is the most mind-numbing part of the compiler. Writing a parser by hand is tricky and not worth the alleged performance increases. Parsing performance likely won’t be the bottleneck.

Parser combinators are very popular. It’s cute that you can parameterize grammars by making them into functions. But the problem with parser combinators is they implement parsing expression grammars. PEGs are popular too, but they’re too rigid.

In a PEG, the disjunction or choice operator (typically denoted by the pipe character |) is ordered choice. If you have a context-free grammar like foo:=a|b|c, you can expect the parser generator to be smart about which case to take given input.

In PEGs – and, therefore, in parser combinators – choices are tried in the order in which they are listed. This leads to huge headaches.

Things like left recursion lead to difficult to debug infinite loops, you will have to contort your grammar to fit the rules of PEG. The end result does not look like the cute parser combinator one-liners that draw everyone in.

Good old fashioned parser generators like Menhir or ANTLR don’t have this problem. Choice is unordered, like in the EBNF notation you find in language specifications. Left recursion is implemented “intelligently”, usually without requiring you to transform your grammar.

Menhir is a mature parser generator. It is integrated into OCaml’s build system: dune. The documentation is alright, but there’s enough practical tutorials to get you started.