Build a Programming Language, Part III

Welcome back to our continuing series on creating your own programming language!

Where we left off last time, we described a really simple language we called PangolinV1 and talked about some of the design decisions that went into it, as well as provided interpreters to play around with.

The theme of this installment is going to be data. The first version of Pangolin only had numbers and functions. Technically speaking that’s all you need for a full programming language, at least according to logician Kurt Goedel (see links below about Goedel numberings!), but that’s a pretty silly way to write code.

In PangolinV2 we’re going to add a few different kinds of data:

booleans
arrays / lists
characters
strings

And we’re going to talk about adding types to Pangolin as well!

So in PangolinV1 we had if-statements and while-loops, but we just used numbers to mimic the boolean values of “true” and “false”. 0 or less was “false” and 1 or more was “true”. Now, that’s a design choice with a pretty long history. For example, C uses numbers for making choices. Meanwhile, in Python and JavaScript numbers are technically distinct from booleans but 1 equals true in both languages, and you can use numbers in place of actual booleans.

On the other hand, here’s a problem that happens silently in both JavaScript and C:

if (x = 10) {
    ...
}
else {
    ...
}

Here you probably meant to test if x == 10; instead you just set x to 10, which then returns 10, which will always be interpreted as true. That’s not right at all!

Having a separate boolean type helps prevent mistakes like this! It comes with a price, though, because it makes our implementation more complicated.

That’s actually the running theme of language implementation: when you make things better for programmers you’re probably making it harder for yourself. That’s not necessarily a bad thing, though, because it keeps making languages from being boring!

So in PangolinV2 we’re going to have true and false as separate keywords. That means we need to change our interpreter to reject numbers used in place of true or false and rewrite our if-statement, while-loop, <, and == code to expect booleans instead of numbers. We hit another design choice, though, which is do we want a way to convert numbers to booleans. This is called “type coercion” and happens in languages like JavaScript even without the programmer deciding it should, leading to some very odd things as you’ll see in one of the links below.

My opinion is that type coercion is a good option to give a programmer, but they have to use it consciously and not have it happen as a surprise. So we’re going to introduce the operator (toBool n) that takes the number n and turns it into true if n > 0, and turns it into false if n < 0.

What do you think? Should there also be a way to turn booleans into numbers? For your own language, do you just want to let numbers be used to determine what’s true and what’s false? These are choices you get to make, and you get to form your own opinions!

The next piece of data we’re adding to Pangolin is lists. Now, lists or arrays exist in some form in basically every programming language. We use them to collect data together in an easy to process form. In JavaScript they’re called arrays, while in Python they’re called lists, though they both work basically the same way: collections of data stored in order whose size is allowed to change over the course of the program.

In other languages, though, arrays and lists are different things. In C or Java, for example, arrays are fixed in size. When you make it you need to declare how big it is going to be in advance, like in C or C++

int n[10] // an array of 10 integers

or in Java

int[] n = new int[10]

and with arrays like these you can’t easily change the number of “slots” for data like you can with JavaScript arrays.

Why on Earth, you might wonder, would you want to have arrays that are more limited and inconvenient? Well, fundamentally it’s about efficiency. In a language like C or C++, you want the code to run fast and use as little memory as possible. So in C if you make an array of 10 integers, you’ll be using 320 bits of data and nothing more. That kind of precision in memory usage is important if you’re trying to program a device like an Arduino or write an operating system, but for a lot of programming it just doesn’t matter. If you want your language to be eventually used for programming hardware like C is, then you’ll probably want to add fixed length arrays into your language.

Also, why does Java, which is a programming language very unsuited to programming microcontrollers or operating systems, have fixed size arrays? I think it’s ultimately because the designers of Java were trying to not frighten the C/C++ programmers they were attempting to recruit.

In Pangolin V2 we’re going to have lists like Python or JavaScript. You make a new list one of two ways: either by starting with nil and using the :: operator to append new elements to the list, or using the list built-in function with a series of arguments.

So to make a list out of nothing, in PagolinV2 we’d write

(:: 1 (:: 2 (:: 3 nil)))

(list 1 2 3)

Both of these functions return a list, (1 2 3), so :: in Pangolin is more like Python’s + for concatenating lists than it is like Python’s append function.

You can access elements of lists with the function nth, whose name I’ve stolen from the language Common Lisp, and you can change elements of lists like so:

(begin
 (var x (list 1 2 3))
 (set (nth 0 x) 100))

which changes the 0th, or “first”, element of the list to 100, so now the list x becomes (100 2 3)

Here’s a weird question, though, that you need to consider: what’s the value of the list x at the end of this Pangolin program?

(begin
 (var x (list 1 2 3))
 (var y (:: 0 x))
 (set (nth 1 y) 20))

Should x be the list (20 2 3) or should it be (1 2 3)?

Sounds innocuous, right? But it gets at a really important question: as you grow the list, are you making new lists or adding onto one single list, or in other words are you sharing structure between the two lists? There’s advantages and disadvantages to both. The biggest advantage to sharing structure is that it makes your code more efficient. Copying lists takes both time and memory. The biggest disadvantage of sharing structure is that you might get unexpected things happening if you’re not paying attention.

In Pangolin, unlike its inspiration in Common Lisp, we’re not going to share structure. Whenever you add to a list with the :: operator you’ll get back a new list that’s a copy of the old one with the new data added on.

The only other issue with lists is that we also need to add a length function as well in order to allow you to loop through your list.

Finally, we need to talk about characters and strings. So in Pangolin the syntax for a character will just be a character in single quotes like ‘a’ or ‘<'. We'll add this as its own data type just like we've done with numbers and booleans so far. The question, though, is how to handle strings? An approach I personally like is to just let strings be lists of characters. The only complication really comes in parsing and printing: if you want to let programmers use the normal quotation marks to represent strings you'll need to change your interpreter to recognize strings, and when you print out a list of characters you’ll need to put the quotation marks back in. As always, there's going to be interpreters for you to play around with and build off of in the repository https://github.com/clarissalittler/creating-languages.

So if you’ve been following along or are just getting started with this series, think about the questions I’ve asked in the article and about the choices I’ve made. Pangolin is meant to just be an example of my choices based on my preferences. Ultimately, you should make the language you want to program in.

Learn More

Learn More

Type coercion in JavaScript

How type coercion gets weird

Goedel Numberings

Beej’s Guide to C Programming

Arrays in C

Pointers in C

The book on C Programming

A novice friendly book on Common Lisp