6

2021-08-16: Let's write a compiler, part 3: A parser

 2 years ago
source link: https://briancallahan.net/blog/20210816.html
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
neoserver,ios ssh client
Brian Robert Callahan

Brian Robert Callahan

academic, developer, with an eye towards a brighter techno-social life


[prev]
[next]

2021-08-16
Let's write a compiler, part 3: A parser

All source code for this blog post can be found here.

Tokenizing the source code input is a great start for our compiler. Now let's turn our attention to parsing the PL/0 grammar. Once we have finished with the parser, the front end of our compiler will be finished and we will have a PL/0 validator. This is a program that can read in PL/0 source code and inform the user whether or not it is valid PL/0 code.

There are many strategies available for writing a parser. And lexers too, for that matter. While we wrote a lexer by hand because it's a useful thing to do at least once, you could have used the flex tool instead to generate a lexer for you. Much the same, for a simple grammar like that found in PL/0, you could use a tool such as yacc to take the PL/0 grammar and output a parser for us.

But since the point of this series is to learn, we will write a hand-crafted recursive-descent parser. Among other features, a recursive-descent parser is a top-down parser. That means we will start with the very first token in the PL/0 source code and continue in order until we do something meaningful with the last token in the PL/0 source code.

Some helper functions

Before we get to writing the parser itself, we are going to need two helper functions, which I am going to name next() and expect(). The next() function is how the parser will control the lexer, as that is how the parser can tell the lexer to fetch the next token from the PL/0 source code and record all of the useful information for that token: the token type and, if needed, the token string. This is exactly what our lexer is already doing, so we don't need to make any changes to the lexer at all. We just need to hook it up to the parser.

The expect() function is how we will enforce proper syntax. The function will take one parameter, a token type, and it will check that the currently recorded token type matches the token type we expect to have at this point in the grammar. If it does not, we will report a syntax error through our already existing error mechanism. If it does match, have the lexer fetch the next token.

The code for these two helper functions might look like this:

/*
 * Parser.
 */

static void
next(void)
{

	type = lex();
	++raw;
}

static void
expect(int match)
{

	if (match != type)
		error("syntax error");

	next();
}

Reading the PL/0 grammar

We will need to be able to read the Extended Backus-Naur Form (EBNF) style grammar that is that long grammar comment in our compiler source code in order to be able to write our parser. Without going into lots of formal theory or detail, let's keep in mind the following rules for reading the grammar:

  • Everything to the left of an = sign will become a function in our parser.
  • A | means "or." For example, a factor can be an identifier or a number or an expression wrapped in parentheses.
  • Anything wrapped in "double quotes" means literally this thing. So for example the "const" in the block rule means that if you get here you must literally see const as the next token, i.e., a TOK_CONST token.
  • Anything wrapped in [ square brackets ] means you must have exactly zero or one of these. To take again for example the entire line in the block rule that begins with [ "const" essentially means that you may have exactly zero or one constant declaration sections in a block.
  • Anything wrapped in { curly braces } means you may have zero or more of these, with no upper limit. In the block rule, you may have as many procedures as you want or no procedures, if that's what you want.
  • Anything wrapped in ( parentheses ) means you must have exactly one of these. This only appears in the condition, expression, and term rules.
  • The bare dot at the end of each rule simply symbolizes the end of the rule and doesn't have any meaning for us.
  • We don't need to worry about what an identifier or a number is. If you remember, the lexer knows how to determine those things and provides us with the token string for all identifiers, reserved words, and numbers. The lexer also validates numbers for us. For the sake of repetition, in our version of the PL/0 language an identifier begins with an A-Za-z_ character and is followed by zero or more A-Za-z0-9_ characters. A number begins with a 0-9 character and contains one or more 0-9 characters with optional _ characters for human readability. The reserved words are all lowercase.

Writing our first parser functions

Let's start at the top. The very first rule in our grammar is program. The program rule is defined to be a block rule followed by a literal dot. Let's write that out in code:

static void
parse(void)
{

	next();
	block();
	expect(TOK_DOT);

	if (type != 0)
		error("extra tokens at end of file");
}

OK, so I slightly violated my rule of naming the function after the rule. This will be the only time I do this. This is because parse() was the name of the function that held the lexer printing logic from the previous entry and I've simply replaced that logic with this new logic that begins the compilation process.

Let's see what we have here. The first thing we do is call next() because if we do not, then the lexer will not fetch a token for us and we won't have anything to work with. Next we have what looks extremely similar to the rule itself: a block function followed by us enforcing a literal dot. Because the expect() function fetches the next token automatically, we can also check that no other tokens exist after that literal dot. If there is, that's an error because you have code after the point in which the grammar says you cannot have any more code. You can have whitespace and comments after the final literal dot, because our lexer does not return on comments as comments are not tokens. The lexer will instead loop again, reach the end of the input, and return 0, signifying the end of the input. And thus we are OK for this special case of having comments after the final dot.

Now all we have to do is descend down into the functions that are as-of-now undefined, which is the block() function. Which just so happens to correspond to the next rule in the PL/0 grammar, the block rule.

Dealing with optional sections

The block rule has several optional sections. The first is the const section. As it is wrapped in square brackets, that means you can have exactly zero or exactly one const sections. We can turn this into code by following the same rules we did for writing the previous function, except we will wrap it inside an if statement that checks to see if the first token is a TOK_CONST token. If it is, we will execute everything in the block for exactly one const section. If it is not, we will skip the entire block for exactly zero const sections.

This opening part of the block() function might look like this:

static void
block(void)
{

	if (type == TOK_CONST) {
		expect(TOK_CONST);
		expect(TOK_IDENT);
		expect(TOK_EQUAL);
		expect(TOK_NUMBER);
		while (type == TOK_COMMA) {
			expect(TOK_COMMA);
			expect(TOK_IDENT);
			expect(TOK_EQUAL);
			expect(TOK_NUMBER);
		}
		expect(TOK_SEMICOLON);
	}

You might have already figured out how to deal with the curly brace wrapped parts of the grammar reading through the code above. Curly braces mean zero or more, with no upper limit. And so handling curly braces is exactly the same as dealing with square brackets, except the if for square brackets becomes a while for curly braces which allows looping and therefore any number of arbitrary iterations or zero iterations.

We can follow these rules to finish the block() function:

static void
block(void)
{

	if (type == TOK_CONST) {
		expect(TOK_CONST);
		expect(TOK_IDENT);
		expect(TOK_EQUAL);
		expect(TOK_NUMBER);
		while (type == TOK_COMMA) {
			expect(TOK_COMMA);
			expect(TOK_IDENT);
			expect(TOK_EQUAL);
			expect(TOK_NUMBER);
		}
		expect(TOK_SEMICOLON);
	}

	if (type == TOK_VAR) {
		expect(TOK_VAR);
		expect(TOK_IDENT);
		while (type == TOK_COMMA) {
			expect(TOK_COMMA);
			expect(TOK_IDENT);
		}
		expect(TOK_SEMICOLON);
	}

	while (type == TOK_PROCEDURE) {
		expect(TOK_PROCEDURE);
		expect(TOK_IDENT);
		expect(TOK_SEMICOLON);

		block();

		expect(TOK_SEMICOLON);
	}

	statement();
}

Both the const and var sections are in square brackets, while the procedure section is in curly braces. The statement section is not inside anything, so it is not optional. You might also notice that a procedure is little more than a named block, and it recursively calls block() to do its work. This means that you can have infinitely nested procedures as per the grammar. That's a common feature of Pascal-like languages. But I am going to do a little bit of future planning here and forbid nested procedures. You may have as many procedures as you want, they just can't be nested inside each other. If I remember correctly, Pascal compilers enforced a limit on how deep the nested procedures could go (though I'm not sure if any banned nested procedures outright, someone on Hacker News will let us know, I'm sure).

We will not alter the grammar in order to achieve this forbidding of nested procedures, however. I want us to implement a parser for the PL/0 grammar exactly as the grammar is written. We will instead use some metadata to enforce no nested procedures. Let's create a new global variable named depth and we will increment the depth level at the beginning of each block and decrement the depth level at the end of each block. We can use this to check that we don't have nested procedures.

That makes our finished block() function look like this:

static void
block(void)
{

	if (depth++ > 1)
		error("nesting depth exceeded");

	if (type == TOK_CONST) {
		expect(TOK_CONST);
		expect(TOK_IDENT);
		expect(TOK_EQUAL);
		expect(TOK_NUMBER);
		while (type == TOK_COMMA) {
			expect(TOK_COMMA);
			expect(TOK_IDENT);
			expect(TOK_EQUAL);
			expect(TOK_NUMBER);
		}
		expect(TOK_SEMICOLON);
	}

	if (type == TOK_VAR) {
		expect(TOK_VAR);
		expect(TOK_IDENT);
		while (type == TOK_COMMA) {
			expect(TOK_COMMA);
			expect(TOK_IDENT);
		}
		expect(TOK_SEMICOLON);
	}

	while (type == TOK_PROCEDURE) {
		expect(TOK_PROCEDURE);
		expect(TOK_IDENT);
		expect(TOK_SEMICOLON);

		block();

		expect(TOK_SEMICOLON);
	}

	statement();

	if (--depth < 0)
		error("nesting depth fell below 0");
}

Additionally, you might notice that while procedures have names, they do not accept any parameters nor do they return any values. They are equivalent to a void function in C that accepts no parameters, i.e., void procedure(void). Keep that in mind when you are writing PL/0 code to give to our PL/0 compiler.

Descending further down the grammar

Are there any functions in the block() function that we have not yet implemented? We are going to keep asking this question until every grammar rule has its own function. The answer is yes, just one: statement(). Lucky us, the statement rule is the next rule in the grammar.

Handling or

The statement rule has a number of possibilities. Each possibility is separated by a |, which means "or." To figure out what to do, we must look at the first token in each possibility. What we are looking for is that every possibility begins with a different token. If that were not the case, we might have to rewrite the affected possibilities until we reached a point where each possibility begins with a different token. We are lucky, or possibly Dr. Wirth did it by design (it was done by design), because each possibility begins with a different token. That lends itself well to a switch statement. Handling | here means a switch statement on the first token with each possibility being its own case.

That might look like this:

static void
statement(void)
{

	switch (type) {
	case TOK_IDENT:
		expect(TOK_IDENT);
		expect(TOK_ASSIGN);
		expression();
		break;
	case TOK_CALL:
		expect(TOK_CALL);
		expect(TOK_IDENT);
		break;
	case TOK_BEGIN:
		expect(TOK_BEGIN);
		statement();
		while (type == TOK_SEMICOLON) {
			expect(TOK_SEMICOLON);
			statement();
		}
		expect(TOK_END);
		break;
	case TOK_IF:
		expect(TOK_IF);
		condition();
		expect(TOK_THEN);
		statement();
		break;
	case TOK_WHILE:
		expect(TOK_WHILE);
		condition();
		expect(TOK_DO);
		statement();
	}
}

All we need to do is follow the logic we already learned implementing the program and block rules to implement the statement rule, with each possibility living in its own case.

Note that there is no default case. This is to leave open the possibility for the empty statement.

Descending even further down the grammar

Let's ask the question again: are there any undefined functions in our grammar? Yes, there are two. Both of them were first used in the statement() function: expression() and condition(). And sure enough, there is a grammar rule named expression and a grammar rule named condition. The condition rule is next in the grammar, so let's do that one first.

The condition rule has two possibilities, and, just like the statement rule, each possibility begins with a different token. Since there are only two possibilities, a switch statement is unnecessary. Also, because the expression rule has many different possibilities for a starting token, though none of those possibilities is a TOK_ODD token, let's check for a TOK_ODD token and if it's not, it must be an expression.

That might look like this:

static void
condition(void)
{

	if (type == TOK_ODD) {
		expect(TOK_ODD);
		expression();
	} else {
		expression();

		switch (type) {
		case TOK_EQUAL:
		case TOK_HASH:
		case TOK_LESSTHAN:
		case TOK_GREATERTHAN:
			next();
			break;
		default:
			error("invalid conditional");
		}

		expression();
	}
}

No new functions. Just the expression() function that we have yet to implement. So let's implement it. Implementing the expression rule will trigger needing to write the term and factor rules as well, which are all the rules we have left to implement. Let's finish the job:

static void
factor(void)
{

	switch (type) {
	case TOK_IDENT:
	case TOK_NUMBER:
		next();
		break;
	case TOK_LPAREN:
		expect(TOK_LPAREN);
		expression();
		expect(TOK_RPAREN);
	}
}

static void
term(void)
{

	factor();

	while (type == TOK_MULTIPLY || type == TOK_DIVIDE) {
		next();
		factor();
	}
}

static void
expression(void)
{

	if (type == TOK_PLUS || type == TOK_MINUS)
		next();

	term();

	while (type == TOK_PLUS || type == TOK_MINUS) {
		next();
		term();
	}
}

Wrapping up the parser

There are no undefined functions to implement in our code, and all the grammar rules have a corresponding function in our compiler, so our parser is finished. And now with both our lexer and our parser finished, we have a completed front end for our PL/0 compiler. We should be able to use our front end as a validator, as it should accept all valid PL/0 code and reject all invalid PL/0 code.

For completeness, here is all the code we have written thusfar:

#include <sys/stat.h>

#include <ctype.h>
#include <fcntl.h>
#include <limits.h>
#include <stdarg.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>

#define TOK_IDENT	'I'
#define TOK_NUMBER	'N'
#define TOK_CONST	'C'
#define TOK_VAR		'V'
#define TOK_PROCEDURE	'P'
#define TOK_CALL	'c'
#define TOK_BEGIN	'B'
#define TOK_END		'E'
#define TOK_IF		'i'
#define TOK_THEN	'T'
#define TOK_WHILE	'W'
#define TOK_DO		'D'
#define TOK_ODD		'O'
#define TOK_DOT		'.'
#define TOK_EQUAL	'='
#define TOK_COMMA	','
#define TOK_SEMICOLON	';'
#define TOK_ASSIGN	':'
#define TOK_HASH	'#'
#define TOK_LESSTHAN	'<'
#define TOK_GREATERTHAN	'>'
#define TOK_PLUS	'+'
#define TOK_MINUS	'-'
#define TOK_MULTIPLY	'*'
#define TOK_DIVIDE	'/'
#define TOK_LPAREN	'('
#define TOK_RPAREN	')'

/*
 * pl0c -- PL/0 compiler.
 *
 * program	= block "." .
 * block	= [ "const" ident "=" number { "," ident "=" number } ";" ]
 *		  [ "var" ident { "," ident } ";" ]
 *		  { "procedure" ident ";" block ";" } statement .
 * statement	= [ ident ":=" expression
 *		  | "call" ident
 *		  | "begin" statement { ";" statement } "end"
 *		  | "if" condition "then" statement
 *		  | "while" condition "do" statement ] .
 * condition	= "odd" expression
 *		| expression ( "=" | "#" | "<" | ">" ) expression .
 * expression	= [ "+" | "-" ] term { ( "+" | "-" ) term } .
 * term		= factor { ( "*" | "/" ) factor } .
 * factor	= ident
 *		| number
 *		| "(" expression ")" .
 */

static char *raw, *token;
static int depth, type;
static size_t line = 1;

static void expression(void);

/*
 * Misc. functions.
 */

static void
error(const char *fmt, ...)
{
	va_list ap;

	(void) fprintf(stderr, "pl0c: error: %lu: ", line);

	va_start(ap, fmt);
	(void) vfprintf(stderr, fmt, ap);
	va_end(ap);

	(void) fputc('\n', stderr);

	exit(1);
}

static void
readin(char *file)
{
	int fd;
	struct stat st;

	if (strrchr(file, '.') == NULL)
		error("file must end in '.pl0'");

	if (!!strcmp(strrchr(file, '.'), ".pl0"))
		error("file must end in '.pl0'");

	if ((fd = open(file, O_RDONLY)) == -1)
		error("couldn't open %s", file);

	if (fstat(fd, &st) == -1)
		error("couldn't get file size");

	if ((raw = malloc(st.st_size + 1)) == NULL)
		error("malloc failed");

	if (read(fd, raw, st.st_size) != st.st_size)
		error("couldn't read %s", file);
	raw[st.st_size] = '\0';

	(void) close(fd);
}

/*
 * Lexer.
 */

static void
comment(void)
{
	int ch;

	while ((ch = *raw++) != '}') {
		if (ch == '\0')
			error("unterminated comment");

		if (ch == '\n')
			++line;
	}
}

static int
ident(void)
{
	char *p;
	size_t i, len;

	p = raw;
	while (isalnum(*raw) || *raw == '_')
		++raw;

	len = raw - p;

	--raw;

	free(token);

	if ((token = malloc(len + 1)) == NULL)
		error("malloc failed");

	for (i = 0; i < len; i++)
		token[i] = *p++;
	token[i] = '\0';

	if (!strcmp(token, "const"))
		return TOK_CONST;
	else if (!strcmp(token, "var"))
		return TOK_VAR;
	else if (!strcmp(token, "procedure"))
		return TOK_PROCEDURE;
	else if (!strcmp(token, "call"))
		return TOK_CALL;
	else if (!strcmp(token, "begin"))
		return TOK_BEGIN;
	else if (!strcmp(token, "end"))
		return TOK_END;
	else if (!strcmp(token, "if"))
		return TOK_IF;
	else if (!strcmp(token, "then"))
		return TOK_THEN;
	else if (!strcmp(token, "while"))
		return TOK_WHILE;
	else if (!strcmp(token, "do"))
		return TOK_DO;
	else if (!strcmp(token, "odd"))
		return TOK_ODD;

	return TOK_IDENT;
}

static int
number(void)
{
	const char *errstr;
	char *p;
	size_t i, j = 0, len;

	p = raw;
	while (isdigit(*raw) || *raw == '_')
		++raw;

	len = raw - p;

	--raw;

	free(token);

	if ((token = malloc(len + 1)) == NULL)
		error("malloc failed");

	for (i = 0; i < len; i++) {
		if (isdigit(*p))
			token[j++] = *p;
		++p;
	}
	token[j] = '\0';

	(void) strtonum(token, 0, LONG_MAX, &errstr);
	if (errstr != NULL)
		error("invalid number: %s", token);

	return TOK_NUMBER;
}

static int
lex(void)
{

again:
	/* Skip whitespace.  */
	while (*raw == ' ' || *raw == '\t' || *raw == '\n') {
		if (*raw++ == '\n')
			++line;
	}

	if (isalpha(*raw) || *raw == '_')
		return ident();

	if (isdigit(*raw))
		return number();

	switch (*raw) {
	case '{':
		comment();
		goto again;
	case '.':
	case '=':
	case ',':
	case ';':
	case '#':
	case '<':
	case '>':
	case '+':
	case '-':
	case '*':
	case '/':
	case '(':
	case ')':
		return (*raw);
	case ':':
		if (*++raw != '=')
			error("unknown token: ':%c'", *raw);

		return TOK_ASSIGN;
	case '\0':
		return 0;
	default:
		error("unknown token: '%c'", *raw);
	}

	return 0;
}

/*
 * Parser.
 */

static void
factor(void)
{

	switch (type) {
	case TOK_IDENT:
	case TOK_NUMBER:
		next();
		break;
	case TOK_LPAREN:
		expect(TOK_LPAREN);
		expression();
		expect(TOK_RPAREN);
	}
}

static void
term(void)
{

	factor();

	while (type == TOK_MULTIPLY || type == TOK_DIVIDE) {
		next();
		factor();
	}
}

static void
expression(void)
{

	if (type == TOK_PLUS || type == TOK_MINUS)
		next();

	term();

	while (type == TOK_PLUS || type == TOK_MINUS) {
		next();
		term();
	}
}

static void
condition(void)
{

	if (type == TOK_ODD) {
		expect(TOK_ODD);
		expression();
	} else {
		expression();

		switch (type) {
		case TOK_EQUAL:
		case TOK_HASH:
		case TOK_LESSTHAN:
		case TOK_GREATERTHAN:
			next();
			break;
		default:
			error("invalid conditional");
		}

		expression();
	}
}

static void
statement(void)
{

	switch (type) {
	case TOK_IDENT:
		expect(TOK_IDENT);
		expect(TOK_ASSIGN);
		expression();
		break;
	case TOK_CALL:
		expect(TOK_CALL);
		expect(TOK_IDENT);
		break;
	case TOK_BEGIN:
		expect(TOK_BEGIN);
		statement();
		while (type == TOK_SEMICOLON) {
			expect(TOK_SEMICOLON);
			statement();
		}
		expect(TOK_END);
		break;
	case TOK_IF:
		expect(TOK_IF);
		condition();
		expect(TOK_THEN);
		statement();
		break;
	case TOK_WHILE:
		expect(TOK_WHILE);
		condition();
		expect(TOK_DO);
		statement();
	}
}

static void
block(void)
{

	if (depth++ > 1)
		error("nesting depth exceeded");

	if (type == TOK_CONST) {
		expect(TOK_CONST);
		expect(TOK_IDENT);
		expect(TOK_EQUAL);
		expect(TOK_NUMBER);
		while (type == TOK_COMMA) {
			expect(TOK_COMMA);
			expect(TOK_IDENT);
			expect(TOK_EQUAL);
			expect(TOK_NUMBER);
		}
		expect(TOK_SEMICOLON);
	}

	if (type == TOK_VAR) {
		expect(TOK_VAR);
		expect(TOK_IDENT);
		while (type == TOK_COMMA) {
			expect(TOK_COMMA);
			expect(TOK_IDENT);
		}
		expect(TOK_SEMICOLON);
	}

	while (type == TOK_PROCEDURE) {
		expect(TOK_PROCEDURE);
		expect(TOK_IDENT);
		expect(TOK_SEMICOLON);

		block();

		expect(TOK_SEMICOLON);
	}

	statement();

	if (--depth < 0)
		error("nesting depth fell below 0");
}

static void
parse(void)
{

	next();
	block();
	expect(TOK_DOT);

	if (type != 0)
		error("extra tokens at end of file");
}

/*
 * Main.
 */

int
main(int argc, char *argv[])
{
	char *startp;

	if (argc != 2) {
		(void) fputs("usage: pl0c file.pl0\n", stderr);
		exit(1);
	}

	readin(argv[1]);
	startp = raw;

	parse();

	free(startp);

	return 0;
}

On the next episode: Testing

How do we know our front end is correct, anyway? Yes, you can take my word for it. But don't take my word for it. Have test programs that you can use to verify the correctness of the compiler. As we turn our attention to the code generator, we will have to make tweaks to the parser to store metadata about what we are parsing and pass that to the code generator. It would be good to, at each step along the way, ensure that we don't have any errors in the parser creep in. And what if we wanted to add new reserved words or fix the unary minus deficiency in the PL/0 grammar? It would be good to know we didn't break anything in those cases.


OpenBSD httpd


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK