12

Interpreters and WTFs - Lexers

 1 year ago
source link: https://leondaz.io/interpreters-and-wtfs
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
neoserver,ios ssh client

LeOndaz.io

Interpreters and WTFs - Lexers

Interpreted languages are widely used, including but not limited to Python, PHP, Javascript. In this series of articles, we're creating our interpreted language called wink.

Before we start

This is Part 1 of a new series named Interpreters and WTFs.

Here's a list of the series progress.

  1. Lexers - 15/10/2022

Prerequisites

  • TypeScript knowledge, or JavaScript and knowledge of any typed language
  • Patience, there are lots of details to cover.

Interpreted? What does Google Translate has to do with it anyway

Interpreting is a way to take in an input string that you defined the syntax for and execute it. It's called interpreted because the code is interpreted in the language we choose to develop the interpreter in.

Assumptions

  • Wink only supports ASCII charset.
  • Wink does not support floating-point arithmetic either single or double precision.
  • Too much abstraction will lead to errors, this is our first interpreter, it should be more easy, not more performant. For this, we will have many repetitions in the code.
  • Wink will support all common mathematical operators like >=, <=, ==, !=
  • Wink functions will be First-class citizens

We will start by defining the syntax for our language, we then cover what the first part of the interpreter does and see the API in action on this page. After that, we follow TDD to implement that part. After our tests get working, we start our implementation of the interpreter.

Let's define the syntax of our wink language.

// add.wink

let add = fn (x, y) {
  x + y;  // return implied
};

What we expect after parsing this wink module is a function that adds two numbers x & y and gives back the result.

So you decide to write a specific algorithm to do this. You chose TypeScript as your language.

Your algorithm should:

  • Read the .wink file
  • Read each line
  • Parse each entity depending on the meaning we assumed for it

Let's match each single entity that corresponds to a specific meaning, ignoring whitespace matches, they are represented as follows one per line

let  // variable declaration
add  // function name
=  // assignment operator
fn  // function reserved keyword
(  // opening parenthesis
x  // variable name
,  // comma
y  // variable name
)  // ending parenthesis
{  // opening braces
x  // variable name
+  // plus sign
y  // variable name
}  // closing braces
;  // semi-colon

Each single entity has a meaning denoted in the comment next to it. Let's call each one a token, because we're developers, we like to make everything hard to make studying harder for people actively joining this field.

Let's define a format for our tokens in Typescript.

// lexing.ts

type TTokenType = string // you can make this an enum with all types

interface IToken {
  type: TTokenType  // what you use to distinguish tokens
  literal: string  // what you type in your IDE when writing wink
}

The type clearly says the token type, and the literal is just the token value, so for a LET token, it's defined as

{
  type: "LET",
  litereal: "let"
}

Note that we could add much more information to the token like the line number, most interpreters actually do add the line number and the file name for showing useful error messages with the line number and the file name. But I just want to KISS.

Now we've defined our token, let's create something to hold all the token types instead of manually typing it each time.

// lexing.ts

// variable or function names
const IDENTIFIER = "IDENTIFIER"

// reserved keywords
const FUNCTION  = "FUNCTION"
const LET       = "LET"

// mathematical symbols
const PLUS = "PLUS";
const ASSIGN = "ASSIGN";

// symbols
const COMMA = "COMMA";
const SEMICOLON = "SEMICOLON";

// parenthesis, braces
const LPAREN = "LPAREN";
const RPAREN = "RPAREN";
const LBRACE = "LBRACE";
const RBRACE = "RBRACE";

You could use Enums, but it would make it a bit more complex, so it's left for the reader as an exercise as they say.

Let's just say that we need to write a function that takes code as input and gives an array of tokens as output

Generalization of the problem

Given a string in the language we're creating

let add = fn (x, y) {
  x + y;
};

We should get this output

[
  { type: LET, literal: "let" },
  { type: IDENTIFIER, literal: "add" },
  { type: ASSIGN, literal: "=" },
  { type: FUNCTION, literal: "fn"},
  { type: LPAREN, literal: "(" },
  { type: IDENTIFIER, literal: "x" },
  { type: COMMA, literal: "," },
  { type: IDENTIFIER, literal: "y" },
  { type: RPAREN, literal: ")" },
  { type: LBRACE, literal: "{" },
  { type: IDENTIFIER, literal: "x" },
  { type: PLUS, literal: "+" },
  { type: IDENTIFIER, literal: "y" },
  { type: RBRACE, literal: "}" },
  { type: SEMICOLON, literal: ";" },
]

This process is called Lexical Analysis or Lexing for short. Lexing is the first part of interpreting a language. Lexers are only responsible for tokenizing the input, it's not the lexer's duty to validate the syntax.

While lexing, you can choose to ignore whitespace characters such as \r, \n, \t and " " like our lexer did above, however, there're some languages that their lexers do not ignore it like Python.

The lexer constructor we're creating is available on this webpage in the console accessed via a global variable Lexer.

This is the interface that's publicly available on this webpage. Other methods and helpers are hidden for convenience.

interface ILexer {
    // variables
    keywords: Record<string, string>;
    input: string;
    ch: string | null;
    position: number;
    nextPosition: number;

    // methods
    nextToken(): void;
    peekChar(): string;
    readChar(): void;
    readIdentifier(): string;
    readDigit(): string;
    skipWhitespace(): void;
}

Let's try playing around with it, open your console F12 or CMD/CTRL OPT/ALT I

$ const input = ";$1213ss let x;{}";
$ lexer = new Lexer(input);

$ lexer.nextToken();
{type: 'SEMICOLON', literal: ';'} // 1

$ lexer.nextToken();
undefined // 2

$ lexer.nextToken();
{type: 'INT', literal: '1213'} // 3

$ lexer.nextToken();
{type: 'IDENTIFIER', literal: 'ss'}

$ lexer.nextToken();
{type: 'LET', literal: 'let'}

$ lexer.nextToken();
{type: 'IDENTIFIER', literal: 'x'}

$ lexer.nextToken();
{type: 'SEMICOLON', literal: ';'}

$ lexer.nextToken();
{type: 'LBRACE', literal: '{'}

$ lexer.nextToken();
{type: 'RBRACE', literal: '}'}

$ lexer.nextToken();
undefined // 4
  1. This string is not valid wink, however, it's not the lexer's job to validate it.
  2. We can't handle $ character, this is why we're introducing a new token type called ILLEGAL and this shows up whenever we encounter a token we can't process. By the time your scroll past this list point, trying to work again with the lexer will show ILLEGAL instead of undefined.
  3. What if the number is a floating point number. Most languages support IEEE 754 floating-point arithmetic
  4. Again, we don't have a token to handle the empty string AKA EOF, so that's what the weird EOF errors you were getting were about. Again, scrolling past this point will allow you to see EOF in the webpage's lexer.

We will be using this lexer for the rest of the article, some features will unlock after you scroll to a specific section. This is intended to help you avoid getting overwhelmed with details.

Now let's get comfortable using the lexer before we start our implementation

$ const input = "let author = 'LeOndaz';"
$ const lexer = new Lexer(input);

$ lexer.readIdentifier();
'let'

$ lexer.skipWhitespace();
$ lexer.readIdentifier();
'author'

$ lexer.skipWhitespace();

$ lexer.position // current position in input
11

$ lexer.ch // character at input[lexer.position]
'='

$ lexer.nextPosition // may seem redundant, we will get to that later
12

$ lexer.keywords // you can't use those for variable names
{let: 'LET', fn: 'FUNCTION'}

The interface is simple. Here are some notes:

  • We always read forwards from the current character ch.
  • If the current character is a whitespace character, pass.
  • We've read both let which is a reserved keyword and author which is not by the same method.
  • We can read numbers using readDigit, the usage is the same as readIdentifier.
  • The nextPosition sounds redundant because we can just position+1, but it makes it easy when we process symbols that have more than one character like == for example.
  • nextToken always returns a token, even if all input was processed.

Let's add the missing tokens before we go on

// lexing.ts

// for numbers
const INT = "INT";

// maths
const GT = "GREATER_THAN";
const GTE = "GREATER_THAN_OR_EQUAL";
const LT = "LESS_THAN";
const LTE = "LESS_THAN_OR_EQUAL";
const EQ = "EQUAL";
const NE = "NOT_EQUAL";

const MINUS = "MINUS";
const ASTERISK = "ASTERISK";
const FSLASH = "FSLASH";
const EXCLAMATION = "EXCLAMATION";
const POWER = "POWER";

const ILLEGAL = "ILLEGAL"
const EOF = "EOF"

Notes:

  • We name symbols after their original name, not their functionality.
  • Those variables are just for you, make sure you put values you know the token type from when you see.
  • By scrolling past this point, you've unlocked those new features we just added in the webpage's lexer, test them out.
  • This page has a global function called tokenize that just creates a lexer and gets all the tokens out instead of manually calling nextToken many times. Let's try it out
$ const input = "2 ** 3 <= 2 ** 4; 2 != 4;"
$ tokenize(input);
{type: 'INT', literal: '2'}
{type: 'POWER', literal: '**'}
{type: 'INT', literal: '3'}
{type: 'LESS_THAN_OR_EQUAL', literal: '<='}
{type: 'INT', literal: '2'}
{type: 'POWER', literal: '**'}
{type: 'INT', literal: '4'}
{type: 'SEMICOLON', literal: ';'}
{type: 'INT', literal: '2'}
{type: 'NOT_EQUAL', literal: '!='}
{type: 'INT', literal: '4'}
{type: 'SEMICOLON', literal: ';'}

By now, you should have understood what this lexer does, let's dive a bit deeper.

  • Our lexer starts at input[0], we set lexer.ch to input[0]
  • The readChar increments the position and nextPosition by one.
  • The peekChar returns input[nextPosition] without incrementing the position variables
  • The nextToken checks the current saved character:
  • If it's a letter, we assume that it's an identifier of some sort, either a variable name or a reserved keyword so we save the current position and look at the next character, if it's a letter, we keep reading, till we encounter a non-letter character. we then get input[savedPosition: currentPosition]. This gets the just read identifier, we just have to check if it's a reserved keyword or not.
  • If it's a number, we assume it's the beginning of a number, we save the position and read till we encounter a non-numeric character and we get input[savedPosition: currentPosition]
  • If it's one-character symbol like semi-colon, we just return the token associated with semi-colons.
  • If it's a one-character symbol like =, we have to peek further to see if it's followed by another = and either return the EQ token or ASSIGN token.
  • Same logic applies to any two-symbol token like >=, <=, !=.
  • If we are at the end, we set lexer.ch as null

Now you should have a good idea about what it does and how it works, let's define the allowed characters, we mentioned that we check if a character is a letter or a digit, so we're going to need some helpers for this.

// lexing.ts

// s is a string of length 1
const isLetter = (s: string) => {
    return (
        (s.charCodeAt(0) >= 65 && s.charCodeAt(0) <= 90)
        ||
        (s.charCodeAt(0) >= 97 && s.charCodeAt(0) <= 122)
        ||
        s === "_"
    );
}

// s is a string of length 1
const isDigit = (s: string) => {
    return !!parseInt(s);
}

If you want to extend the charset that wink supports, this is the place to peek in, I've allowed the underscore character as you can see. If you wanted for example to allow the Roman numerals or Hindu–Arabic numeral system, this is also the place. JavaScript allows some weird symbols to exist in identifier names like $, the symbol got adopted by jQuery later on. The $ in jQuery is just a function name.

Creating tests

Let's start by adding tests that we want our implementation of lexers to pass.

// tests/lexing_test.ts
test("Lexer should be able to parse wink's syntax correctly without errors", () => {
  expect(() => {
    const input: string = `
                let multiply = fn (x, y) {
                    x * y;
                };

                3 >= 2;
                2 != multiply;
          `

    const tests: IToken[] = [
      // multiply function
      {type: LET, literal: "let"},
      {type: IDENTIFIER, literal: "multiply"},
      {type: ASSIGN, literal: "="},
      {type: FUNCTION, literal: "fn"},
      {type: LPAREN, literal: "("},
      {type: IDENTIFIER, literal: "x"},
      {type: COMMA, literal: ","},
      {type: IDENTIFIER, literal: "y"},
      {type: RPAREN, literal: ")"},
      {type: LBRACE, literal: "{"},
      {type: IDENTIFIER, literal: "x"},
      {type: ASTERISK, literal: "*"},
      {type: IDENTIFIER, literal: "y"},
      {type: SEMICOLON, literal: ";"},
      {type: RBRACE, literal: "}"},
      {type: SEMICOLON, literal: ";"},

      // 3 >= 2
      {type: INT, literal: "3"},
      {type: GTE, literal: ">="},
      {type: INT, literal: "2"},
      {type: SEMICOLON, literal: ";"},

      // 2 != multiply
      {type: INT, literal: "2"},
      {type: NE, literal: "!="},
      {type: IDENTIFIER, literal: "multiply"},
      {type: SEMICOLON, literal: ";"},
    ]

    const lexer = new Lexer(input);

    for (const test of tests) {
      const token = lexer.nextToken();

      if (token.type != test.type) {
        throw new Error(`Expected type ${test.type} but got type ${token.type}`);
      }

      if (token.literal != test.literal) {
        throw new Error(`Expected literal ${test.literal} but got literal ${token.literal}`)
      }
    }
  }).not.toThrowError();
})

The test passes. It was able to parse wink and now, we're ready to start working on the language itself

Let's create a class that implements ILexer

// lexing.ts

class Lexer implements ILexer {
  keywords: Record<string, string> = {
      "let": LET,
      "fn": FUNCTION,
  }

  // provide sensible defaults
  input: string;
  ch: string | null = null;
  position: number = 0;
  nextPosition: number = 1;

  constructor(input: string) {
    this.input = input
    this.ch = input[this.position];
  }

  readChar(): void {};
  nextToken(): IToken {};
  peekChar(): string {};
  readIdentifier(): string {};
  readDigit(): string {};
  skipWhitespace(): void {};

Let's implement readChar first, as everything is built on top of it

// lexing.ts

readChar() {
  if (this.position >= this.input.length) {  // 1
      this.ch = null;
  } else {
      this.ch = this.input[this.nextPosition]; // 2
  }

  this.position++  // 3
  this.nextPosition++;  // 4
}
  1. If we are out of index, it's EOF, so the current character will be null.
  2. If not, set this.ch to the next character in input
  3. Increment the position by one
  4. Increment the next position by one as well

Now let's implement nextToken

// lexing.ts

nextToken() {
  this.skipWhitespace();  // 1
  let t: IToken | undefined = undefined;  // 2

  if (this.ch) {  // 3
    if (isDigit(this.ch)) {
      return {  // 4
        type: INT,
        literal: this.readDigit(), // 5
      }
    } else if (isLetter(this.ch)) {
        const identifier = this.readIdentifier(); // 6

        if (this.keywords[identifier]) { // 7
          return { // 8
            type: this.keywords[identifier],
            literal: identifier,
          }
        } else {
          return {
            type: IDENTIFIER, // 9
            literal: identifier,
          }
        }
    } else if (this.ch === ";" ){ // 10
        t = {  // 11
          type: SEMICOLON,
          literal: ";"
        }
      } else if (this.ch === "=") {
        t = {
          type: ASSIGN,
          literal: ";"
        }
      } else if (this.ch === "+") {
        t = {
          type: PLUS,
          literal: "+"
        }
      }
  }

  this.readChar();  // 11
  return t;  // 12
}

This is first subset we've mentioned without EOF & ILLEGAL tokens

  1. If the lexer.ch is a whitespace character, increment the positions.
  2. A storage for the resulting token.
  3. If there's a lexer.ch, otherwise, it's EOF (we didn't handle this yet).
  4. Notice the return, we didn't use t here, the reason is an implementation detail in readDigit.
  5. readDigit reads the current digit to the end, it calls readChar repeatedly and this is why we're returning, because nextToken calls readChar at the end, we don't want to call it again after readDigit.
  6. readIdentifier is the same as readDigit but with a different predicate function (You guessed, you can generalize it).
  7. If the identifier is a reserved keyword.
  8. Return it's type from the keywords map which maps KEYWORD:TYPE, again, returning and not using t.
  9. The type of a non-reserved keyword is IDENTIFIER.
  10. If it's neither a letter nor a digit, it must be a symbol, if it's a semi-colon, set t to semi-colon tokens, here we are using t.
  11. Increment the current position variables, thus, calling nextToken the next time works on a different character. This is why we used t as we need the exection to continue to this point. Thus, we didn't return.
  12. This will never be undefined.

Let's extend this to handle EOF and ILLEGAL as well.

// lexing.ts

nextToken() {
  this.skipWhitespace();  // 1
  let t: IToken | undefined = undefined;  // 2

  if (this.ch) {  // 3
    if (isDigit(this.ch)) {
          return {  // 4
              type: INT,
              literal: this.readDigit(), // 5
          }
    } else if (isLetter(this.ch)) {
          const identifier = this.readIdentifier(); // 6

        if (this.keywords[identifier]) { // 7
              return { // 8
                type: this.keywords[identifier],
                literal: identifier,
              }
        } else {
              return {
                type: IDENTIFIER, // 9
                literal: identifier,
              }
        }
    } else if (this.ch === ";" ){ // 10
        t = {  // 11
            type: SEMICOLON,
            literal: ";"
        }
      } else if (this.ch === "=") {
          t = {
            type: ASSIGN,
            literal: ";"
          }
      } else if (this.ch === "+") {
          t = {
            type: PLUS,
            literal: "+"
          }
      }
      else { // 1
          t = {
            type: ILLEGAL,
            literal: this.ch
          }
      }
  } else {  // 2
      t = {
        type: EOF,
        literal: ""
      }
  }

  this.readChar();
  return t;
}

  1. If we don't know the character, it's ILLEGAL
  2. If the lexer.ch is null, it's EOF

Seems easy? Let's add the maths and boolean operators as well

// lexing.ts

nextToken(): IToken {
  this.skipWhitespace();

  let t: IToken | undefined = undefined;

  if (this.ch) {
    if (isDigit(this.ch)) {
        return {
          type: INT,
          literal: this.readDigit()
        }
    } else if (isLetter(this.ch)) {
        const identifier = this.readIdentifier();

        if (this.keywords[identifier]) {
            return {
              type: this.keywords[identifier],
              literal: identifier
          }
        } else {
            return {
              type: IDENTIFIER,
              literal: identifier,
          }
        }
      } else if (this.ch === ";") {
          t = {
            type: SEMICOLON,
            literal: ";"
        }
      } else if (this.ch === "{") {
          t = {
            type: LBRACE,
            literal: "{"
        }
      } else if (this.ch === "}") {
          t = {
            type: RBRACE,
            literal: "}"
          }
      } else if (this.ch === "(") {
          t = {
            type: LPAREN,
            literal: "("
          }
      } else if (this.ch === ")") {
          t = {
            type: RPAREN,
            literal: ")"
          }
      } else if (this.ch === "=") {  // 1
          const nextChar = this.peekChar();  // 2

          if (nextChar === "=") {  // 3
            this.readChar(); // 4

            t = {
                type: EQ, // 5
                literal: "=="
            }
          } else {
              t = {
                type: ASSIGN, // 6
                literal: "="
              }
          }
      } else if (this.ch === "+") {
          t = {
            type: "PLUS",
            literal: "+"
          }
      } else if (this.ch === ",") {
          t = {
            type: COMMA,
            literal: ","
          }
      } else if (this.ch === "-") {
          t = {
            type: MINUS,
            literal: "-"
          }
      } else if (this.ch === "*") {
          const nextChar = this.peekChar();

          if (nextChar === "*") {
            this.readChar();
            t = {
                type: POWER,
                literal: "**"
            }
          } else {
              t = {
                type: ASTERISK,
                literal: "*"
              }
          }
      } else if (this.ch === "!") {
          const nextChar = this.peekChar();

          if (nextChar === "=") {
            this.readChar();

            t = {
                type: NE,
                literal: "!="
              }
          } else {
              t = {
                type: EXCLAMATION,
                literal: "!"
              }
          }
      } else if (this.ch === ">") {
          const nextChar = this.peekChar();

          if (nextChar === "=") {
            this.readChar();

            t = {
                type: GTE,
                literal: ">="
              }
          } else {
              t = {
                type: GT,
                literal: ">"
              }
          }
      } else if (this.ch === "<") {
          const nextChar = this.peekChar();

          if (nextChar === "=") {
            this.readChar();

            t = {
                type: LTE,
                literal: "<="
              }
          } else {
              t = {
                type: LT,
                literal: "<"
              }
          }
      } else {
          t = {
            type: ILLEGAL,
            literal: this.ch,
          }
      }
  } else {
      t = {
        type: EOF,
        literal: ""
      }
  }

  this.readChar();
  return t as IToken;
}
  1. If the character is equal sign.
  2. Get a quick sneak-peek.
  3. If the next sign is another equal sign.
  4. Increment the position, read the next character as we already know it's an equal sign.
  5. Set t as EQ operator.
  6. If the next character is not an equal sign, it's just a normal assign operator.
  7. Same applies to any two symbol operator like >=, <=, **, !=.

Recap

We have created our piece of software that understands the language syntax we've defined ourselves. This piece of software is called a lexer, a lexer creates some objects with some information called tokens. Tokens contain information about the underlying code, they can contain from the token type to where it lies in the file. We've created a test to ensure that our lexer never fails if we decided to upgrade it at some point.

In the next article, we're going to work with creating a Parser for our language.

The parser takes in the output of the lexer as an input and outputs an AST.

You can think of the AST as a big JSON object that clearly represents the input code. We will cover it in great detail in the next couple of articles.

Repository

You can find all the code associated with this series in LeOndaz/the-wink-language

Support

You can buy me a coffee if you're interested.


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK