How parsers and compilers work

Download this zip file to obtain the source code of files discussed in this article.

A few months ago I began a personal project to learn a bit more about how parsers and compilers work.

Here are some notes that I made during that project. Maybe they will be of use to you.

I'd like to learn how to write parsers and compilers. If I could write my own parsers and compilers, I could do things like:

To learn, I have decided to embark on a project to write a parser and a compiler in Python. Python is a powerful, high-level, object-oriented language that is also very readable. It would allow me to work with basic concepts without getting bogged down in language mechanics.

I had been studying the subject for some time and I had learned some basic concepts. But I was totally confused by discussions that conflated the scanner, lexer, and parser — some even used the terms "scanner", "lexer", and "parser" interchangeably. Then I discovered How does an interpreter/compiler work? and became enlightended. It clearly laid out the different functions of the scanner, lexer, and parser. Here are the lines that triggered my awakening:

Source File —> Scanner —> Lexer —> Parser —> Interpreter/Code Generator

Scanner: This is the first module in a compiler or interpreter. Its job is to read the source file one character at a time. It can also keep track of which line number and character is currently being read. .... For now, assume that each time the scanner is called, it returns the next character in the file.

Lexer: This module serves to break up the source file into chunks (called tokens). It calls the scanner to get characters one at a time and organizes them into tokens and token types. Thus, the lexer calls the scanner to pass it one character at a time and groups them together and identifies them as tokens for the language parser (which is the next stage).

Parser: This is the part of the compiler that really understands the syntax of the language. It calls the lexer to get tokens and processes the tokens per the syntax of the language.

Every compiler is written to process source files in a particular language. A COBOL compiler compiles COBOL code; it doesn't compile Fortran. A Python interpreter parses and executes Python; not Ruby.

So the place to start in compiler development is with the language that I want to compile. In particular, I need some mechanism for creating a precise, formal specification of the language that I want to compile. That mechanism is EBNF. So the first step is to use EBNF to specify the language that I want to process.

But to begin with, I don't really need to know much about the particular language to write a scanner. So I will defer thinking about EBNF and language specification until it I start to build the lexer.

The source text contains the text of a program written in the specified language.

The scanner's job is to read the source file one character at a time. For each character, it keeps track of the line and character position where the character was found. Each time the scanner is called, it reads the next character from the file and returns it.

So let's write a scanner.

Since we're all object-oriented now, we write two classes.

The first class is a Character class that will wrap a single character that the scanner retrieves from the source text. In addition to holding the character itself (its cargo) it will hold information about the location of the character in the source text. [SourceCode]

The second class is a Scanner class. When we instantiate this class, we will pass the constructor a string containing the source text. The result will be a scanner object that is ready to scan that particular string of source text. [SourceCode]

Having created the scanner machinery, let's put it through its paces.

I create a driver program that sets up a string of source text, creates a scanner object to scan that source text, and then displays the characters that it gets back from the scanner. [SourceCode]

When we run the scanner driver program, it produces the following results.

A lexical analyser is also called a lexer or a tokenizer.

The lexer's job is to group the characters of the source file into chunks called tokens. (If the source text was written in a natural language (English, Spanish, French, etc.) the tokens would correspond to the words and punctuation marks in the text.) Each time the lexer is called, it calls the scanner (perhaps several times) to get as many characters as it needs in order to assemble the characters into a token. It determines the type of token that it has found (a string, a number, an identifier, a comment, etc.) and returns the token.

A scanner can be pretty much language-agnostic, but a lexer needs to have a precise specification for the language that it must tokenize. Suppose we want to process a language called nxx. Then the lexer needs to know the answers to questions like these about nxx:

Most people think of whitespace as being any string that contains only instances of the space character, the tab character, and the NL (newline) character (or the carriage return + linefeed characters (CRLF) that Windows uses to indicate a newline).

For a lexer, a whitespace character is: a character whose sole purpose is to delimit tokens. In a COBOL statement like this, for example:

MOVE STATE-ID TO HOLD-STATE-ID.

the spaces aren't significant in themselves, but without them the COBOL compiler would see:

MOVESTATE-IDTOHOLD-STATE-ID.

and wouldn't know what to make of it.

Whitespace characters are used by the lexer to detect the end of tokens, but generally are not passed on to the parser. Here is (a rough approximation of) the list of the tokens that the lexer would pass to the COBOL parser. Note that that the list does not include any whitespace tokens.

In many languages, the whitespace characters consist of the usual suspects: SPACE, TAB, NEWLINE. But consider a language in which each statement must exist on its own line. For such a language the NEWLINE character is not whitespace at all but a token indicating the end of a statement in the same way that a semi-colon does in Java and PL/I. Another example: Python uses indentation (tabs or spaces) rather than keywords or symbols (e.g. "do..end" or "{..}") to control scope. So for Python (in at least some contexts) spaces and tabs are not whitespace characters.

Another case in which the lexer would pass whitespace tokens back to its caller is if the calling module is making some modifications to the input text (for example, removing comments from the source code) but otherwise leaving the source text intact, whitespace and all.

For the lexer, we will need some tokenizing rules.

The standard way to specify rules for recognizing tokens is via a finite state machine (FSM), also known as a deterministic finite automaton (DFA). So explanations of how to build a tokenizer often begin this way: Build a FSM that.... Or this way: Write a regular expression to match....

But in many cases, you don't need the power of a full FSM — the job can be done in a much simpler way. For example, most programming languages have simple rules for identifier tokens that can be expressed this way:

So let's write a lexer.

First, we invent a simple language called nxx. We won't bother to define nxx very carefully -- we won't need to do that in order to write a simple lexer that can process identifiers and whitespace.

We will need to define at least the rudiments of our language. That means specifying the symbols— the assignment symbol, grouping symbols such as parentheses and brackets, symbols for mathematical operations (plus, minus, etc.), and so on.We also need to define the rules for tokenizing comments, string literals (strings), numeric literals (numbers), and identifiers. For example, for nxx we say that a string can be contained in either single or double quotes, an identifier can contain letters, numeric digits, and underscores, and must start with a letter, and so on. I defined these in a file called nxxSymbols.py.

Now we write two classes.

The first class is a Token class that will wrap its cargo -- a string of characters that is the text of the token. In addition to holding its cargo, the token will hold information about the location of the token (actually, the location of its first character) in the source text. [SourceCode]

The second class is a Lexer class. When we instantiate this class, we will pass the constructor a string containing the source text. The lexer object will create a scanner object and pass the source text to it. Then the lexer will get characters from the scanner, which will get them from the source text. The result will be a lexer object that is ready to return the tokens in the source text. [SourceCode]

Having created the lexer machinery, let's put it through its paces. We will create a driver program that sets up a string of source text, creates a lexer object to tokenize that source text, and then displays the tokens that it gets back from the lexer. [SourceCode]

Here is the core code of the driver program.

Here is the source text that we will pass to the lexer.

When we run the lexer on this source code, the lexer produces the following results. That is, it produces this list of tokens.

Parsing Techniques (first edition, 1990) by Dick Grune and Ceriel Jacobs is a great book about parsing techniques. This book is freely downloadable from http://www.cs.vu.nl/~dick/PTAPG.html.

I learned a lot from the book.

The first thing I tried to do was to implement (in Python) what they consider the best-ever recursive-descent parsing technique. It is explained on pp. 137-140 of their book. Unfortunately, the results were disappointing. I won't dispute their claim that this is a fabulous technique. But I found it clumsy and difficult to understand. I certainly couldn't just sit down and write a parser using the technique.

I had always heard that it wasn't difficult to implement a recursive-descent parser. But that technique was much too difficult (for me at least).

While searching the Web for more information about recursive-descent parsers, I found the WikiPedia article on recursive-descent parsers, complete with EBNF and example in C. And I could see that it is easy to write a recursive-descent parser. As the Wikipedia article puts it — "Notice how closely the [code] mirrors the grammar. There is a procedure for each nonterminal in the grammar."

Here is the Python translation of that code.

Many discussions of parsers omit the most important thing — that the purpose of a parser is to produce an abstract syntax tree, and AST. So here is some code for a node in an AST.

So here is a Python recursive descent parser for nxx.

Here is the code for its driver.

And here is the output of the nxx parser when run on nxx1.txt. This is (as near as I can tell) pretty close to an AST.

Download this zip file to obtain the source code of files discussed in this article.

End of this article/web page.