Building a Programming Language in Twenty-Four Hours

Published: [ 2023-06-05 08:45 ]
Categories: [ programming ]
Tags: [ rust, plt ]

I know I promised to put out an article every week on a Sunday. I haven't been too good at sticking to that schedule. It's now Sunday morning at around 8:50 AM. I have a thing I have to be at by 4 PM. Naturally, I'm going to pick something that will take a lot more than seven hours.

Enter: Building a programming language from scratch in six hours (or less).

Hey, future Ersei here. This took more than six hours. Just FYI.

Let's see how this goes.

Discuss this post on Mastodon

This should not be used as a tutorial for how to create a programming language. It is my actual thought process and my failures while making one. Everything that I write is wrong, and if you copy it down then what you make will be wrong too (until you get to the part where I realize it's wrong and fix it). You have been warned.

My friend told me that I am not a real programmer because I have not made my own programming language yet. Because I can't take a joke, I decided to take her up on the challenge and make one. There were a few small issues with that, though.

I have no experience building a programming language.

Literally none.

But I know how to start projects! Set some goals!

This programming language must be able to:

Be fast
Be able to call external C code
Look nice (kinda)
Portable
(Time permitting) have data structures like a dynamically allocated list

Hey, future Ersei here. About two of these got implemented. Just FYI.

Let's do what every project starts with: searching the internet.

There are a lot of options! People have made their own hobby languages already! I opened up all of the links on the first page up in another tab.

The first link I clicked on gave me this:

Over the past 6 months, I've been working on a programming language called Pinecone. I wouldn't call it mature yet, but it already has enough features working to be usable, such as variables, functions and user defined structures (freeCodeCamp)

Uh, six months? I need to get all of that done in six hours (or less.)

I've also just spent the last ten minutes writing all of that and putting in screenshots!

Anyway, there seem to be a few steps in making a programming language: first is picking the language that your language is built in. Let's go for Rust for no other reason than for the meme.

Next up is lexing (which is to take the programming language text and turn it into a set of "tokens" for the compiler to look at). The first link also told me that there are existing lexers that'll do it all for you. Let's search for a lexer in Rust. The first link is pest. Might as well use it.

I am also now realizing that I need a name for my Git repo where I'm gonna be putting all of this. Let's go with... Ersei-Lang, shortened to Erlang! No, that won't work. EPL? For Ersei-Programming-Language? Apparently it already exists for programming printers. Uh, how about picking a random word from the dictionary? Let's go with...

A random word generator returning the word

Cane! Good enough for me. A cursory search returns no other project called "Cane", so I think I'm good.

Wow, I did really just spend ten more minutes picking a name.

Anyway, let's get started! Sit down behind the computer, open up Vim, and get typing!

Introducing the Cane Programming Language

If you really want to read the whole development story, go to the Development Hell section.

To download the language, go to the Sourcehut page.

To mess with the language online, go to the web playground.

Without further ado, here's some code snippets!

; Simple "Hello, world!" program ~

(`hello, ' `world\n' add print)

;
It's pretty simple to define a variable
You need: the data, and a string as a name.
~

((1 2 3 4 add) `myvarname' defvar)
(myvarname `\n' print)
(myvarname `\n' print)
(myvarname `\n' print)
(myvarname `\n' print)

; All variables are immutable! You can replace the variable, but you can't edit it. ~

((1 2 3 4 add) `myvarname' defvar)
(myvarname `\n' print)
((1 2 3 4 5 add) `myvarname' defvar)
(myvarname `\n' print)

Pretty cool, huh?

How about something more dynamic? Make a function (I call them "actions" because why not)

;
actions, or functions, are just strings that are executed. arguments passed are available
through the `args' variable
~

([(args add)] `myaction' defact) ; define an action called `myaction' ~
(1 2 3 4 myaction print) ; call myaction with arguments 1 2 3 4 ~

Now print the numbers from 1 to 100!

; Prints the numbers from 1 to 100 ~

(1 `count' defvar)
(
    [
        (count `\n' print)
        (count 1 add `count' defvar)
    ]
    100 repeat
)

There are a lot more demos on the Git repo.

The TL;DR

Cane is a weird language. It's like if you showed me Lisp and asked me to go crazy. It's got parentheses!

This is also technically a functional programming language. The whole program will evaluate to a state. Again, I have no idea how functional programming languages work. The most experience I have is with Nix and even then I just copy-paste things into my config until it works.

I am probably the worst person to be making a Lisp-like functional-ish programming language. Mostly because I know just enough to be dangerous. And not enough to know any better. Additionally, I don't know enough Rust to make this efficient, which is why there might be some, uh, idiosyncrasies.

But! I'm a real programmer now 😎

This was certainly a journey, and I might come back in a few days/weeks/months to add a feature or three. But this was just a proof-of-concept, to prove that I can and to do it all without cheating (using external libraries/LLVM/looking at someone else's code). My journey consists of novice mistakes, rediscoveries of old programming paradigms (reverse polish notation ftw!), and mostly a lot of fun. Over the course of the past two months (I had final exams and then I took a break from computers so I could destress), this has been my project. I fell asleep thinking about this. I woke up with new ideas. The last three days especially have been so fun!

I recommend doing something like this yourself. Take on a project that can be as simple or as complicated or as easy or as hard as you want. Like building a programming language. Give yourself a time limit. See what happens. No stress like with hackathons or homeworks. Just time yourself. Estimate if you don't really care. Give it a shot. You'll have fun, I promise.

And then email me to show me what you've made ❤️

Development Hell

None of this actually made it to the end project. Go down to the Designing the New Language section to see something more similar to the end result.

$ cargo new cane-lang

Let's also add in the lexer dependency into our Cargo.toml.

$ cargo add pest

Now let's waste precious time for Cargo to update the Crates index.

And while we're at it, let's update our Neovim configuration and hope it doesn't break. You can tell how I put off writing this until the last minute right?

Anyway, now what? We have a Git repository, we have a lexer installed, but what do we do after that?

Now we have to take our tokens and turn it into an abstract syntax tree. If you don't know what that means, it's fine, neither do I. I think it's like, the map of your code as a tree or something. I don't know, I just read the first line in the Wikipedia article.

Oh good, someone has done a lot of the hard work for me

Let's do all of that.

I am also now realizing that my Rust skills are… rusty. I forget how to do everything. Maybe Rust wasn't the best language to pick, but here I am.

$ rm -rf cane-lang
$ cargo new cane-lang --lib
$ cd cane-lang
$ mkdir src/bin

Now that we have some structure in our project, I can get started on actually doing work! Let's figure out what our programming language will look like (spoiler alert: it's gonna look like C). Yes, there will be semicolons and you can't stop me.

Looking at that theforce programming language, it seems like I can just… copy the code and change up the tokens in the lexer and the test and I will be done? See? Programming language in six hours (or less). It's more like five hours now because I've spent so much time just researching and looking at documentation and I have zero idea what's going on.

Anyway, let's use that programming language as a help card, to draw inspiration from in the case of being stuck.

Let's poke around that repo and see what's going on. It's a very simple set of code, which is super nice.

Now let's get into programming language design. I'm learning as I write this, so it might not be super accurate, but let's do this.

Introduction to ASTs

An AST is a representation of the code in a "tree" format. The tree is made up of "nodes". One node is an "if" statement, or declaring a string, or to call a function, or assigning a variable. Some nodes can be operations. There are two kinds of operations: unary and binary. Unary operations act on one node and one node only. Think of the "not" operator. It will invert one node. Binary operations act on two nodes, and thus can be recursive since they can act on themselves. Think of adding two nodes together.

Now, we have to take the program file and make it into an AST. We do that with a grammar.

Introduction to Grammar

Grammar represents how the programming language should look. That's it. That's what pest does. It defines a grammar for your language and you can use it to generate a lexer and AST. Let's take a look at what a part of a .pest file looks like (taken from theforce):

Statements = _{ Statement* }

Statement = _{
    DeclareBooleanStatement
    | DeclareFloatStatement
    | DeclareStringStatement
    | AssignStatement
    | AssignFromFunctionStatement
    | PrintStatement
    | ReadFloatStatement
    | ReadStringStatement
    | ReadBooleanStatement
    | ForStatement
    | WhileStatement
    | IfStatement
    | CallFunctionStatement
    | Noop
}

I should probably read the pest documentation while I'm at it.

Anyway, let's put together a basic grammar for Cane. I'm gonna copy the boilerplate from theforce.

Program = _{ SOI ~ Functions ~ EOI }

Functions = { Function* }

Function = _{ VoidFunction | NonVoidFunction | Main }

VoidFunction = {
    DeclareFunction ~ FunctionName
    ~ Parameters
    ~ Void
    ~ Statements
    ~ EndFunctionDeclaration
}

NonVoidFunction = {
    DeclareFunction ~ FunctionName
    ~ Parameters
    ~ Statements
    ~ ReturnStatement
    ~ EndFunctionDeclaration
}

Parameters = { (FunctionParameters ~ VariableName)* }

Main = {
    BeginMain
    ~ Statements
    ~ EndMain }

Statements = _{ Statement* }

Now I have to find out what statements I want to declare and use. theforce uses the following statements:

DeclareBooleanStatement
DeclareFloatStatement
DeclareStringStatement
AssignStatement
AssignFromFunctionStatement
PrintStatement
ReadFloatStatement
ReadStringStatement
ReadBooleanStatement
ForStatement
WhileStatement
IfStatement
CallFunctionStatement
Noop

Of these, I don't want to have multiple Declare statements. I wish to have one strongly-typed declaration statement. Likewise, I don't want to have multiple Read statements for all of the supported types.

So, I add the following to my grammar:

Statement = _{
    DeclareStatement
    | AssignStatement
    | AssignFromFunctionStatement
    | PrintStatement
    | ReadStatement
    | ForStatement
    | WhileStatement
    | IfStatement
    | CallFunctionStatement
    | Noop
}

Now for the hard part: implementing all of these statements. Let's figure out how to implement types.

ooOOoooo Time Skip ooOOoOOoOo

Ok, it's been like an hour of researching pest grammar. To be honest, I don't understand it. Maybe a strongly-typed language isn't a "six hour" project. At the time of writing this sentence, I have exactly three hours left. Let's rethink how types ought to be implemented in this program.

There should be a set of "atomic" types—like integer (split into unsigned int, signed int, unsigned long, and signed long), float (and double), and character. These are the smallest set of data that the language can handle. On top of these "atomic" types, there ought to be types that can combine the atomic types (and other types), like arrays. I don't want to do anything beyond that, as that'll add too much complexity. Being able to use arrays would make this language actually useful! Would'ya look at that?

Enough talk, let's get into it. Instead of just a generic DeclareStatement and AssignStatement, I'm going to use multiple: DeclareAtomicStatement and DeclareArrayStatement (and related).

So, I need something like the following:

DeclareAtomicStatement = {
    DeclareAtomic ~ VariableName ~ TypePickerValue
    ~ PUSH(Types) ~ SetInitialValue ~ ( POP | VariableName )
}

VariableName = { Identifier }

FunctionName = { Identifier }

Identifier = @{ ASCII_ALPHA ~ (ASCII_ALPHANUMERIC)* }

SetInitialValue = _{ "=" }
TypePickerValue = _{ ":" }

ooOooOoOoooo Time Skip Again OoOOOoooO

So, I had a conversation with someone smarter than I am about making my own programming language. To have arrays, I need types. Type theory is hard.

So, I'm starting over. Again. Let's redefine our goals. Our old goals looked something like this:

Be fast
Be able to call external C code
Look nice (kinda)
Portable
(Time permitting) have data structures like a dynamically allocated list

Let's change that up a little:

Portable
Have lists

Hey, future Ersei here. Those are actually what ended up happening.

Now, that brings me to my first "aha" moment: I can make a functional programming language! Maybe something like Scheme or make my own Lisp dialect…

Big issue: I don't know how to program in functional languages. Thus, I do not understand them. Keeping in line with my "six hour" time frame (of which I have like, two left), this might be difficult.

First, let's give myself a little crash course on Lisp-like languages (non-comprehensive, done in five minutes):

It's got lots of parentheses
It uses a lot of singly-linked lists
Very recursive

That's as far as I want to get with understanding Lisp. Let's get right into designing the language:

Designing the New Language

Most of this got changed up. Again. But we are a lot closer to what the final language will look like.

We can worry about how the language will be implemented later (but we're keeping it in mind so that it isn't too difficult later down the road).

Each program is a statement. Each statement can have more than one statement in it. This allows for recursion.

Each statement is of the following syntax: either (action data) or (data). Data is of the following types: a statement, list, or atomic (string, int, float).

All numeric values are infinite precision.

Let's define some keywords:

Hi, future Ersei here. A lot of these have changed. This is not what the language will have in the end.

add A...: adds together all of the items in the data recursively. If any value is a String, then the output will begin to add every additional item as a string.
mult A...: multiplies all of items in the data. If lists are passed, then matrix multiplication occurs. Two strings being multiplied must result in an error. A string multiplied by an integer n will return that string repeated n times.
div A...: divides each value from the next in the data. If non-numeric values are passed, then the interpreter must halt.
dup N A...: returns a list containing N of the same inputs. If more than one item is passed, then the items are put into a list and the list is duplicated.
defact NAME A...: creates an action with name NAME. Will evaluate A in order when called. If more than one A is passed, then the function returns each output in a list. Names must be unique between actions and variables.
free NAME...: frees all variables and actions with the following names, allowing that name to be reused. A variable that is not found will halt the interpreter.
print A...: prints data.
write FILE: writes a string to a file (clobber).
read FILE: reads a file into a list of lines.
ext FILE: import and execute another file.
if CONDITION THEN ELSE: if the condition is non-zero when cast to a number, return the data in THEN. Otherwise, return the data in ELSE.
get RANGE: where range is a set of integers, return those values from the list (inclusive). Zero-indexed.
eq A...: checks if the values are equal. Casts to the first argument. Will return a Number0 if they are not equal. Will return 1 if they are equal.
eqq A... checks if the values and types are equal. Will return a Number0 if they are not equal. Will return 1 if they are equal.
gt A...: checks if the values are in descending order. Casts to the first argument. Will return a Number0 if they are not in order. Will return 1 if they are in order.
gtt A... checks if the values are in descending order. Types must all match. Will return a Number0 if they are not in order or if types do not match. Will return 1 if they are in order.
lt A...: checks if the values are in ascending order. Casts to the first argument. Will return a Number0 if they are not in order. Will return 1 if they are in order.
ltt A... checks if the values are in ascending order. Types must all match. Will return a Number0 if they are not in order or if types do not match. Will return 1 if they are in order.

There will be a few atomic types as well:

String: one or more chars. Empty string is considered 0.
Number: infinite precision.
Action: refers to a set of instructions. If an Action is in a List, it will be ignored.

There is one composition type:

List: a dynamically allocated array of items (including other lists). All lists are zero-indexed.

There will also be a few reserved values:

args: a list of the arguments passed to the Action.

All types will be implicitly cast to the other if needed:

Number to list: 1 will convert to (1)
String to list: "apple" will convert to ("a", "p", "p", "l", "e")
List to number: always will return the size of the list.
List to string: concatenates all items in list to a string, ie ("a", "p", "p", "l", "e") will turn into "apple". This is recursive.
Number to string: 1 will convert to "1". Any floating point numbers will be represented as rational fractions.
String to number: all numerics will be converted into strings and all non-numerics will halt execution. "1" will convert to 1, and "a" will stop the program. Expects rational fraction.

Actions cannot be cast. If attempted, the program must terminate.

The program will continue to execute until an unrecoverable error has occured.

Unrecoverable errors:

Cast to and from Action
Reached end of file and parentheses are not all matched
Undefined Action or variable
Undefined value
Same Action or variable is defined multiple times
Cast string to number when it is not possible to do so
Unexpected token (two types passed instead of a list containing both items)
The empty list is used ()
Two strings are multiplied together
Non-numbers are divided

Comments all start with =. The interpreter must ignore all input until a = is reached. The exception to this is when the comment delimiters are in a string.

The Hello, World program would look something like this:

(print "hello, world")

Alternatively, it could also be:

(print "hello, " "world")

(print (add ("hello") ("world")))

I can figure out more examples later. Now that we have the (maybe broken) spec let's get to programming!

One hour remains. I might end up needing more time.

Designing The Interpreter

This will be an interpreted language. This interpreter will memory-map the file (not required to do so) and read it character by character. As such, the file should not edited while it is being executed.

The interpreter will keep track of the state of the program:

A list of all Actions as a hashmap of the action name and the location in memory where it is kept.
The call stack. This will be implemented as a linked list that grows backward. Every time that the program "jumps", the previous location is added to the stack. When the "jump" is finished (the last parenthesis is closed), the program pops the call stack and continues from that location. If the stack is empty, the program exits.
The data list. This is a dynamic list that contains the current execution state of the program. The list must be comprised of either lists or atomics. As the program progresses, the list is recursively flattened from the bottom-left up.

Sounds simple, right? Now for the hard part: implementing the interpreter.

Fifty minutes remain. I definitely will need more time.

Writing the Interpreter

First, we have to erase everything we've already written, which wasn't that much anyway.

Now, let's go from the ground up. First, we have to implement the three basic types: Lists, Numbers, and Strings. They all have to be part of the same struct so we can use them anywhere. We can also implement casts while we are at it.

After spending a few minutes trying to figure out how to do Rust development on Nix (because I switched to NixOS a few days ago, writeup soon), I can get started.

In src/stdtypes.rs:

#[derive(Clone)]
pub struct List(pub LinkedList<Data>);

#[derive(Clone)]
pub struct Number(Rational);

#[derive(Clone, Copy, PartialEq)]
pub enum Types {
    STRING,
    NUMBER,
    LIST
}

#[derive(Clone)]
pub struct Data {
    kind: Types, 
    val_string: String,
    val_number: Number,
    val_list: List,
}

Next, we can implement the basic Actions: add and print. This is the minimum to implement a basic "Hello world" program. We will worry about the rest later.

A snippet of the add action:

impl Data {
    pub fn add(&mut self) {
        if self.get_kind() == Types::LIST {
            // The first value in the list will determine the type
            let first = self.get_list().0.pop_front();
            if first.is_none() {
                panic!("List is not allowed to be null!");
            }
            let mut sum: Data = first.unwrap();
            sum.add();
            while let Some(mut var) = self.get_list().0.pop_front() {
                if var.get_kind() == Types::LIST {
                    var.add();
                }
                sum = sum + &mut var;
            }
            self.kind = sum.kind;
            self.val_number = sum.val_number;
            self.val_string = sum.val_string;
        }
    }
}

impl std::ops::Add<&mut Data> for Data {
    type Output = Data;
    fn add(mut self, other: &mut Data) -> Data {
        if self.kind == Types::LIST && other.kind == Types::LIST {
            self.as_list().val_list.0.append(&mut other.as_list().val_list.0);
            return self;
        } else if self.kind == Types::LIST {
            // The clone is stupid
            // C would let me do this WITHOUT the clone
            // It's stupid because we are adding to the end of a linked list. This is a pointer to
            // an object. I should be allowed to add it to the list.
            self.as_list().val_list.0.push_back(other.clone());
            return self;
        } else if self.kind == Types::STRING || other.kind == Types::STRING {
            self.as_string().val_string += &other.as_string().val_string;
            return self;
        } else {
            self.as_number().val_number.0 += &other.as_number().val_number.0;
            return self;
        }
    }
}

The remainder of the code is left as an exercise for the reader.

I have spent six hours on the basics and the add function alone. I don't think that this time limit is gonna work out.

I am on hour ten of six. It is also time to go to bed. I'll write a bunch of comments so I don't forget everything tomorrow.

ooOOOOoOooO Time Skip oOOOoooooOOoO

Anyway, now that we have a functioning action, we can start making the actual interpreter.

For this interpreter, we will need to keep track of a few things:

The current position of the executor in the file.
The call stack and the positions to jump back to.
The current data in the program so far as a list. Because of how this language is structured, only one (recursive) list can be active at a time.
A map of each Action to the location in the file they can be found.
A map of each variable name to the data it represents.

A struct that can keep track of this information could look something like this:

struct Interpreter<R: Read> {
    reader: BufReader<R>,
    call_stack: Vec<SeekFrom>,
    data_state: Data,
    actions: HashMap<String, SeekFrom>,
    variables: HashMap<String, Data>,
}
impl<R: Read> Interpreter<R> {
    fn new(reader: BufReader<R>) -> Interpreter<R> {
        Interpreter {
            reader,
            call_stack: Vec::new(),
            data_state: Data::new(Types::LIST),
            actions: HashMap::new(),
            variables: HashMap::new(),
        }
    }
}

Now to make the actual interpreter… this might take a while.

I have spent another thirty minutes so far. Currently at ten and a half hours. Maybe aim for 24? That sounds impressive, right?

So, the way that the interpreter will work is that it will start reading the file until it hits a token. The tokens are as follows:

Open string with `
End string with '
Begin comment with ;
End comment with ~
Begin expression with (
End expression with )

All whitespace (including newlines) will be ignored unless part of a string. Therefore, all whitespace are considered delimiters.

These are in order of priority. A string should continue until it hits another end quote, and so on. Let's make a basic interpreter that will do the following:

Read the input character by character
1. If a "begin" comment is found, then read character by character until and "end" comment is found
2. Ignore any whitespace
3. If anything but an open parenthesis is found, terminate the program with "unexpected token"
4. When the first opening parenthesis is found, begin execution

In the process of writing the code, I realized that it would be a lot easier to make everything based on postfix notation (reverse polish notation):

(`hello world' print)

instead of prefix notation

(print `hello world')

That does bring into question how functions will be handled. We can worry about that later. I have a couple of ideas.

The big question is how am I going handle recursion? After distracting myself and reading a random article on RSS, I came to the conclusion that the interpreter should hold a reference to a data object in the interpreter state. This should be done through the call stack. In addition to the previous position, it will also contain the location of the previous data object as a reference. In addition, the interpreter will also keep track of a "working data" state. Easy enough, right? Let's get started.

pub fn execute(&mut self) -> Result<(), io::Error> {
    let mut buf: [u8; 1] = [0];
    loop {
        match self.reader.read_exact(&mut buf) {
            Err(_) => break,
            _ => (),
        };
        match self.call_stack.last_mut() {
            Some(call) => match call.get_state() {
                Some(State::Wait) => {
                    // Waiting for program to begin
                    // Only a comment or starting data can begin
                    if buf[0] == Symbols::CommentStart as u8 {
                        self.call_stack.push(Call::new(State::Comment));
                    } else if buf[0] == Symbols::DataStart as u8 {
                        self.call_stack.push(Call::new(State::Data));
                    } else if WHITESPACE.contains(&buf[0]) {
                        // Ignore whitespace
                    } else {
                        // Unexpected token
                        return Err(Error::new(ErrorKind::InvalidData, format!(
                            "Unexpected token {} at position {}",
                            std::str::from_utf8(&buf).unwrap(),
                            self.reader
                                .seek(SeekFrom::Current(0))
                                .expect("Could not get current position!")
                        )));
                    }
                }
                Some(State::Data) => {
                    // Read data into object until end of object
                    if buf[0] == Symbols::DataEnd as u8
                        && self.call_stack.last_mut().unwrap().get_token().len() == 0
                    {
                        self.call_stack.pop();
                    } else if buf[0] == Symbols::StringStart as u8 {
                        self.call_stack.push(Call::new(State::String));
                    } else if !WHITESPACE.contains(&buf[0]) && buf[0] != Symbols::DataEnd as u8
                    {
                        self.call_stack
                            .last_mut()
                            .unwrap()
                            .get_token()
                            .push_str(std::str::from_utf8(&buf).unwrap());
                    } else {
                        // Now that the item in the data is split, we can choose what to do
                        // with it
                        // There are a few options. First is to execute a function. Second is
                        // to parse the object as a number. Third is to parse the object as a
                        // string.

                        let mut reset = true;
                        match self.call_stack.last_mut().unwrap().get_token().as_str() {
                            "add" => self.call_stack.last_mut().unwrap().get_data().add(), // Add data
                            "print" => write!(
                                self.stdout,
                                "{}",
                                self.call_stack.last_mut().unwrap().get_data()
                            )
                            .unwrap(),
                            _ => reset = false,
                        }

                        if reset {
                            // Reset token
                            let token = self.call_stack.last_mut().unwrap().get_token();
                            *token = "".to_string();
                        }

                        if buf[0] == Symbols::DataEnd as u8 {
                            self.call_stack.pop();
                        }
                    }
                }
                Some(State::String) => {
                    // If the string is over, push it into the data object
                    if buf[0] == Symbols::StringEnd as u8 {
                        let string_stack_object = self.call_stack.pop().unwrap();
                        // We can unwrap here safely because the only way to get to a string is
                        // when Data calls it
                        self.call_stack
                            .last_mut()
                            .unwrap()
                            .get_data()
                            .get_list()
                            .push(string_stack_object.data);
                    } else if let Some(last) = self.call_stack.last_mut() {
                        // Read data into object until end of string
                        last.get_data()
                            .get_string()
                            .push_str(std::str::from_utf8(&buf).unwrap());
                    }
                }
                Some(State::Comment) => {
                    // Ignore everything until we hit the comment end
                    if buf[0] == Symbols::CommentEnd as u8 {
                        self.call_stack.pop();
                    }
                }
                None => todo!(),
            },
            None => todo!(), // Parse stack is empty, catastrophic failure
        };
    }
    // End of file
    // Check if the program still needs to do anything
    match self.call_stack.last_mut().unwrap().get_state() {
        Some(State::Wait) => Ok(()),
        _ => return Err(Error::new(ErrorKind::UnexpectedEof, "Unexpected EOF!".to_string())),
    }
}

This (and the supporting data structures) took me like another eight hours to put together, mostly since Rust got in the way (but good thing it did otherwise hard-to-find errors would have popped up).

But it works.

It does hello world.

(`hello, ' `world
' add print)

There is no string escaping (yet), so the newline has to be a part of the code and not as an escape character.

But it works.

The relief.

We are at 18.5 hours. Time to implement the other basic parts. Next up is numbers. It's not too hard, I just have to patch the token parsing to check for numbers.

let mut reset = true;
match self.call_stack.last_mut().unwrap().get_token().as_str() {
    "add" => self.call_stack.last_mut().unwrap().get_data().add(), // Add data
    "print" => write!(
        self.stdout,
        "{}",
        self.call_stack.last_mut().unwrap().get_data()
    )
    .unwrap(),
    _ => {
        match Rational::from_str_radix(self.call_stack.last_mut().unwrap().get_token().as_str(), 10) {
            Ok(val) => {
                let mut data = Data::new(Types::NUMBER);
                data.set_number(Number(val));
                self.call_stack.last_mut().unwrap().get_data().get_list().push(data);
            },
            Err(_) => {
                reset = false;
            },
        }
    },
}

That took less than ten minutes. I'm making good time here! What's next? Unwraps!

What are unwraps, you ask? Well, in a situation like this:

(1 2 (3 4 add 1) add print)

the language will add together the two numbers then add a list to it. Because this language is very poorly designed, the list will be coerced into a number. That will always return the size of the list. Therefore, this will return 5.

Unwrapping would look something like this:

(1 2 (3 4 add 1)? add print)

First 1 and 2 will be added, then 3 4 will be added, then the contents of that inner list will be appended to the end of the outer list:

(1 2 7 1 add print)

which will act as expected.

if buf[0] == Symbols::Unwrap as u8 {
    let last_val = self
        .call_stack
        .last_mut()
        .unwrap()
        .get_data()
        .get_list()
        .pop_back();
    match last_val {
        Some(data) => match data.get_kind() {
            Types::LIST => {
                self.call_stack.last_mut().unwrap().get_data().get_list().append(data);
            }
            _ => {
                self.call_stack.last_mut().unwrap().get_data().get_list().push(data);
            }
        },
        _ => {
            return Err(Error::new(
                ErrorKind::InvalidData,
                format!(
                    "Invalid unwrap at position {}",
                    self.reader
                        .seek(SeekFrom::Current(0))
                        .expect("Could not get current position!")
                ),
            ))
        }
    }
}

As I am writing code demos for the language, I am running into issues with newlines. There are no string escapes! I can't write in \n, I have to put in a newline. Let's fix that.

Some(State::StringEscape) => {
    self.call_stack.pop();
    match buf[0] {
        b'n' => self
            .call_stack
            .last_mut()
            .unwrap()
            .get_data()
            .get_string()
            .push_str("\n"),
        _ => {
            let unescape_buf: [u8; 2] = [Symbols::Escape as u8, buf[0]];
            self.call_stack
                .last_mut()
                .unwrap()
                .get_data()
                .get_string()
                .push_str(std::str::from_utf8(&unescape_buf).unwrap());
        }
    };
}

It should be pretty easy to add in more escape characters later on (like if I want to print a single quote). Another forty-five minutes have passed. I am now at 21 hours. I have three hours left to make this language useful.

Of the keywords I defined earlier, I have done exactly two of them: print and add. Who knew making a programming language would be so difficult? I have comments, strings, numbers, nests, unwraps, demos, tests, and internal error handling, so if I were to stop here, I could consider this a success.

But I'm not satisfied yet. The big things remaining are variables and functions. Those would make this language actually useful.

Remember how I said earlier that reverse polish notation would make functions and variables hard? That is because the interpreter reads the file from left to right. It won't know whether to execute the code or save it for later if it's a function, since it hasn't gotten to that part yet.

Because I've already thrown out all sensibility, my brain went to using more than parentheses. A function could be defined with square brackets. Because why not. The interpreter will read the function into memory as a string, then execute it when the function token is found.

Now I'm thinking, why not just use a string? That can be the function! And then, the language can generate its own code to execute!

I am really going crazy here folks.

Things Get Truly Cursed

Let's implement functions-that-are-strings-but-also-executable. It shouldn't be too difficult.

Here's the plan. When defact is hit, then the last item in the data is popped and used as the name, and the data before that is used as the action. Same deal with defvar:

To load the data into the hashmap (this is alongside print):

"defvar" => {
    let key = self
        .call_stack
        .last_mut()
        .unwrap()
        .get_data()
        .get_list()
        .pop_back();
    if key.is_none() {
        return Err(Error::new(
            ErrorKind::InvalidData,
            "No action name found".to_string(),
        ));
    }
    let var = self
        .call_stack
        .last_mut()
        .unwrap()
        .get_data()
        .get_list()
        .pop_back();
    if var.is_none() {
        return Err(Error::new(
            ErrorKind::InvalidData,
            "No variable found".to_string(),
        ));
    }
    self.variables
        .insert(key.unwrap().to_string(), var.unwrap());
    println!("{:?}", self.variables);
}

and to retrieve it (this is after the token to number check):

match self
    .variables
    .get(self.call_stack.last_mut().unwrap().get_token())
{
    Some(data) => {
        println!("found {:?}", data);
        match data.get_kind() {
            Types::ACTION => {
                unimplemented!()
            }
            _ => {
                self.call_stack
                    .last_mut()
                    .unwrap()
                    .get_data()
                    .get_list()
                    .push(data.clone());
            reset = true;
            }
        }
    }
    None => (),
}

That was just one more hour spent. Two hours remain.

((1 2 3 4 add) `myvarname' defvar)
(myvarname `\n' print)

((1 2 3 4 5 add) `myvarname' defvar)
(myvarname `\n' print)

This will print 10 then 15.

Now for functions. There are two ways I can do this: functions can return data, or they can only act on the variables. It'll be easier to implement the second one, but the first one would be more useful.

I think I have enough time. Let's do it. I've already implemented the code to save a string as an action (almost the exact same as the variable saving code, just with a few more lines to change the type to action).

Maybe the best (worst) way to do this is to start another interpreter with a shared variable state and to use a Cursor to read the function data.

Types::ACTION => {
    let mut c = Cursor::new(Vec::new());

    c.write_all(data.to_string().as_bytes()).unwrap();
    let mut reader = BufReader::new(c);
    reader.rewind().unwrap();
    let mut action_interp = Interpreter::new(
        reader,
        self.stdout,
        self.variables,
        self.call_stack
            .last_mut()
            .unwrap()
            .get_data()
            .clone(),
    );
    // Delete old data (since it was copied into
    // args for new instance)
    self.call_stack.last_mut().unwrap().get_data().erase();
    match action_interp.execute() {
        Ok(data) => {
            self.call_stack
                .last_mut()
                .unwrap()
                .get_data()
                .get_list()
                .push(data);
            reset = true;
        }
        Err(e) => return Err(e),
    }
}

It has now been two hours. I have hit the deadline.

This is the last commit in the 24 hour deadline.

But since when did I care about deadlines? To make this language useful, there is one last thing that must be added: control flow.

Back to Sanity (Kinda)

Implementing eq, lt, gt, and so on will be pretty easy. Each of these will take n >= 3 arguments. The first is the condition. The second is the output if the condition is true. The third is the output if the condition is false. Of course, to make it easier for myself, the output conditions are also strings that will be executed (who cares about Rust's call stack? I'm on a deadline here!)

Furthermore, these comparisons will be made strictly. Casting will be implemented pretty easily:

"tolist" => {
    self.call_stack.last_mut().unwrap().get_data().as_list();
    let data = self.call_stack.last_mut().unwrap().get_data();
    let mut changeme = data.get_list().pop_back().unwrap();
    changeme.as_list();
    data.get_list().push(changeme);
}
"tostring" => {
    let data = self.call_stack.last_mut().unwrap().get_data();
    let mut changeme = data.get_list().pop_back().unwrap();
    changeme.as_string();
    data.get_list().push(changeme);
}
"tonum" => {
    let data = self.call_stack.last_mut().unwrap().get_data();
    let mut changeme = data.get_list().pop_back().unwrap();
    changeme.as_number();
    data.get_list().push(changeme);
}

We are popping the last value in the list, converting it, then pushing it back.

Converting a list to a number still returns the size of the list, so getting the size of lists and strings is pretty easy (ex: (string tolist tonum)).

Now to finish off what we started: conditionals.

Here's how it's going to work: if the conditional is found, pop the last two items (for the true and false conditions), then check the data. Based on the check, execute the true or false condition.

"eq" | "lt" | "gt" | "lte" | "gte" => {
    let token = self.call_stack.last_mut().unwrap().get_token().clone();

    let data = self.call_stack.last_mut().unwrap().get_data();
    let mut false_exec = data.get_list().pop_back().unwrap();
    let mut true_exec = data.get_list().pop_back().unwrap();
    let mut c = Cursor::new(Vec::new());

    let is_true = match token.as_str() {
        "eq" => data.is_eq(),
        "lt" => data.is_sorted_asc(),
        "lte" => data.is_sorted_asc_eq(),
        "gt" => data.is_sorted_desc(),
        "gte" => data.is_sorted_desc_eq(),
        _ => unreachable!(),
    };

    if is_true {
        c.write_all(true_exec.get_string().as_bytes()).unwrap();
    } else {
        c.write_all(false_exec.get_string().as_bytes()).unwrap();
    }

    let mut reader = BufReader::new(c);
    reader.rewind().unwrap();
    let mut action_interp = Interpreter::new(
        reader,
        self.stdout,
        self.variables,
        self.args.clone(),
    );
    self.call_stack.last_mut().unwrap().get_data().erase();
    match action_interp.execute() {
        Ok(data) => {
            self.call_stack
                .last_mut()
                .unwrap()
                .get_data()
                .get_list()
                .push(data);
            reset = true;
        }
        Err(e) => return Err(e),
    }
}

I have also realized that the syntax for conditionals is ugly. If you want to return a string, then it looks so gross:

(1 2 3 4 `(`this is ascending!\')' `(`this is not ascending :(\')' lt `\n' print)

Instead, let's make it look better:

(1 2 3 4 `this is ascending!' `this is not ascending :(' lt `\n' print)

Simple! Just have the interpreter treat all unknown characters as a string to save! (This code is left as an exercise to the reader (or just look at the source (wow this sentence looks like the programming language)))

We are now on hour 26 of 24. This programming language is more or less done. I could implement more features, but at this point I wanna make cool shit with Cane.

Dogfooding

Let's make a simple loop. Print the numbers from 1 to 100.

(1 `counter' defvar)
(
    `(counter 100 `(counter `\n\\\' print) (1 counter add `counter\\\' defvar) (body)\' `\' lte)' 
    `body' defact
)
(body)

Now that I'm writing code in this language, I can see how using strings is a bad idea. Maybe a slightly different syntax is needed?

How about using brackets to signify that?

(1 `counter' defvar)
(
    [(counter 100 [(counter `\n' print) (1 counter add `counter' defvar) (body)] 0 lte)]
    `body' defact
)
(body)

It will act almost exactly like a string, but it will match the brackets so no escaping is needed. If a bracket is found in a string, nothing will happen.

It will still return a string though.

Furthermore, I think the current model of creating a new interpreter every time recursion happens (which will happen a lot) is stupid and inefficient. I have a little bit of pride in my work. After the bracket bit though.

Maybe it was a mistake to have the call stack and the parse stack merged...

Anyway, two hours later I have this:

Some(State::LazyEval) => {
    if buf[0] == Symbols::StringStart as u8 {
        self.call_stack
            .last_mut()
            .unwrap()
            .get_data()
            .get_list()
            .last_mut()
            .unwrap()
            .get_string()
            .push_str("`");
        self.call_stack.push(Call::new(State::String));
    } else if buf[0] == Symbols::LazyStart as u8 {
        open_count += 1;
        self.call_stack
            .last_mut()
            .unwrap()
            .get_data()
            .get_list()
            .last_mut()
            .unwrap()
            .get_string()
            .push_str("`");
    } else if buf[0] == Symbols::LazyEnd as u8 {
        close_count += 1;
        // We only want to add the end quote if we are at least one layer deep
        if open_count - close_count > 0 {
            self.call_stack
                .last_mut()
                .unwrap()
                .get_data()
                .get_list()
                .last_mut()
                .unwrap()
                .get_string()
                .push_str(&("\\".repeat(open_count - close_count - 1) + "'"));
        }
    } else if open_count == close_count {
        // If the string is over, push it into the data object
        let mut string_stack_object = self.call_stack.pop().unwrap();
        // Merge the string fragments
        string_stack_object.get_data().as_string();
        // We can unwrap here safely because the only way to get to a string is
        // when Data calls it
        self.call_stack
            .last_mut()
            .unwrap()
            .get_data()
            .get_list()
            .push(string_stack_object.data);
    } else {
        let last = self.call_stack.last_mut().unwrap();
        // Read data into object until end of string
        last.get_data()
            .get_string()
            .push_str(from_utf8(&buf).unwrap());
    }
}

Of course, the rest of the code was tweaked to get this working too.

I am now at hour 28 of 24. I feel exhausted. Optimization be damned, I'm gonna get this ready for release. What a ride.

Butttt, I could release this as WASM and have people be able to mess with it on the web… maybe later.

Also, everything is happening on the stack. Nothing is on the heap. This means that recursion (being so inefficient), can really only go so deep. The loop will max out at around ~760 iterations.

Now that I think about it, I did spend the first six hours or so doing nothing and messing around with AST and LLVM.

I am now at hour 22 of 24 >:3

I think that I need to add more functionality:

DATA N repeat will execute DATAn times
DATA... N duplicate will duplicate DATA...n times
DATA... drop will remove the data from the object
DATA... flatten will recursively unwrap the data until no lists remain

This should get rid of the recursion issue with loops!

For example, flatten is implemented like so:

pub fn make_flat(&mut self) {
    let mut flattened: LinkedList<Data> = LinkedList::new();
    if self.kind == Types::LIST {
        while let Some(mut item) = self.get_list().0.pop_front() {
            flattened.append(&mut item.val_list.0);
        }
        self.set_list(List::new(flattened));
    }
}

mmmm recursion. Tasty. Can you tell I'm losing my sanity over this project?

; Prints the numbers from 1 to 100 ~

(1 `count' defvar)
(
    [
        (count `\n' print)
        (count 1 add `count' defvar)
    ]
    1000000 repeat
)

It has been another hour and a half. I should probably add the last two features I have been thinking about: join and split.

DATA... String join will join the data with the string (think python "stringhere".join(array))
String String split will split the first string with the second. Should return a list.

Join was easy:

pub fn join_string(&mut self, joiner: &str) {
    if self.kind != Types::LIST {
        return;
    }
    let mut new_string = String::new();
    let mut should_strip: bool = false;
    for val in self.iter() {
        should_strip = true;
        new_string.push_str(&(val.to_string() + joiner));
    }
    if should_strip {
        new_string = new_string.strip_suffix(joiner).unwrap().to_string();
    }

    self.set_string(new_string);
    self.kind = Types::STRING;
}

along with the interpreter side of things:

"join" => {
    let joiner = self
        .call_stack
        .last_mut()
        .unwrap()
        .get_data()
        .get_list()
        .pop_back();
    let data = self.call_stack.last_mut().unwrap().get_data();
    if joiner.is_none() {
        return Err(Error::new(
            ErrorKind::InvalidData,
            "Joiner is missing".to_string(),
        ));
    }
    if joiner.as_ref().unwrap().get_kind() != Types::STRING {
        return Err(Error::new(
            ErrorKind::InvalidData,
            "Joiner must be a string".to_string(),
        ));
    }
    data.join_string(joiner.unwrap().get_string());
}

Splitting was much the same. This post is already very code-heavy, but here goes anyway:

"split" => {
    let splitter = self
        .call_stack
        .last_mut()
        .unwrap()
        .get_data()
        .get_list()
        .pop_back();
    let splitme = self
        .call_stack
        .last_mut()
        .unwrap()
        .get_data()
        .get_list()
        .pop_back();
    if splitter.is_none() {
        return Err(Error::new(
            ErrorKind::InvalidData,
            "Splitter is missing".to_string(),
        ));
    }
    if splitter.as_ref().unwrap().get_kind() != Types::STRING {
        return Err(Error::new(
            ErrorKind::InvalidData,
            "Splitter must be a string".to_string(),
        ));
    }
    if splitme.is_none() {
        return Err(Error::new(
            ErrorKind::InvalidData,
            "Splittee is missing".to_string(),
        ));
    }
    if splitme.as_ref().unwrap().get_kind() != Types::STRING {
        return Err(Error::new(
            ErrorKind::InvalidData,
            "Splittee must be a string".to_string(),
        ));
    }
    for item in splitme
        .unwrap()
        .get_string()
        .split(splitter.unwrap().get_string().as_str())
    {
        let mut data = Data::new(Types::STRING);
        data.set_string(item.to_string());
        self.call_stack
            .last_mut()
            .unwrap()
            .get_data()
            .get_list()
            .push(data);
    }
}

I have now hit 24 hours for the second time. I should be done now.

The Aftermath

The TL;DR is a pretty good summary of how I felt after all of this. But you read the whole thing (or scrolled to the bottom), so I should probably have something else here for you.

Since I would consider this an esolang, I need to do funky esolang things with it, like prove Turing-completeness or write a quine. I'm sure it's possible, but I am not good enough at computers to figure out those just yet. I need a break from doing computer things like this. My brain is mush enough already.

So what did I learn?

Rust. Mostly Rust. As I stopped searching for things on the internet (beyond Rust things) once I gave up on using LLVM, I did not learn much else from this project. But it was fun!

What Next?

Probably nothing. Maybe I'll add in some list manipulation features later, but I've already stretched those 24 hours far enough.

Maybe I'll add support for this on Replit with Nix trickery (once I figure out how to use Nix (which is kinda sad considering that I have been using NixOS for a few months now (wow this sentence is also looking like Cane (since it has a lot of parentheses)))). The possibilities are endless.

Is this language stupid? Yes. Could I have done better with more time? No, since I would have gotten bored or burnt out or a combination of the two. I'm glad I put a time limit on this.

The Part in which I Keep Working

I could not put down the language and now I ported it to WASM. It required a restructure of the crates, libraries, and then swap out the only dependency for another similar one.

So yeah. Cane is now Web-Ready. Like Java.

And then my friend tells me that Cane does not work on Windows! Turns out that Windows has different line endings than *nix, so I just needed to add \r to the list of whitespace.

See, now I want to implement even more features. I wanna add some proper math support. Multiplication, exponentiation, division, subtraction, etc)

sub: subtracts the data one after another. All data must be numbers.
mult: multiplies the data one after another. All data must be numbers.
div: divides the numbers one after another. All data must be numbers.
neg: negates the last number.
floor: rounds down to the nearest integer.
ceil: rounds up to the nearest integer.
abs: absolute value of the last number.

Ok, that took like another hour. Now I'm done, I swear.

But there's no way to read stdin… no no no no no no. I am done. Finished.

Aaaand one hour later I could not hold myself back and decided to implement stdin. And forever loops. The loops were easy. It was implementing stdin that was too difficult. I blame Javascript. I could not, for the life of me, get stdin to work on the web playground properly. So the web playground has a non-interactive stdin. Heartbreaking. Maybe when I have more time and energy I can look into something else, like xterm.js.

See, now it's 8:37 in the morning on Monday. I am getting this blog post out at 9:00, regardless of whether it's finished or not.

I think it's finished.

Grand total: 6 hours (messing around with old stuff), 25 hours (language), 4 hours (web playground), 1 hour (blog post).

Hmm, let's figure out how much that is:

((6 25 4 1) `data' defvar)
((data ? ` + ' join) ` = ' (data ? add) println)

6 + 25 + 4 + 1 = 36

Isn't this nice? A programming language in thirty-six hours (but also actually just 25 hours for the actual language). I think this is a 24-hour-ish project, right?

Right?

Corrections? Comments? Just wanna say hi? Get in touch!

Follow me on Mastodon

Building a Programming Language in Twenty-Four Hours

Building a Programming Language in Twenty-Four Hours

Introducing the Cane Programming Language

The TL;DR

Development Hell

Introduction to ASTs

Introduction to Grammar

Designing the New Language

Designing The Interpreter

Writing the Interpreter

Things Get Truly Cursed

Back to Sanity (Kinda)

Dogfooding

The Aftermath

So what did I learn?

What Next?

The Part in which I Keep Working

Recommend

特斯拉翻车后改换门庭？网友猜测小鹏G6新代言人是林志颖

B端系统设计5要素，一学就会！

Apple announces Mac Pro with M2 Ultra, starting at $6,999

Paragraphica - A camera that takes photos using location data | Product Hunt

Apple’s new 15-inch MacBook Air is the ‘world’s thinnest’

组件详解｜级联选择、树形选框、穿梭框，用法有什么区别？

Framer Study Guide

Create and Modify PDF Files in Python

JupyterLab 4.0 is Here

The MacBook Air gets bigger with new 15-inch model

About Joyk