Another tutorial for writing a Forth interpreter in assembly (Part 2)

Part 1 of this tutorial

[Author's note: When I started writing these posts, I thought that surely two parts will be enough. But it wasn't. Maybe I'll come back later and write a version assuming basic assembly is known.]

With the major part of Forth written in Forth, we can now them in "Forth-style" assembly (using call/ret). We will still have to write all the primitive functions in assembly.

Assembly

We'll write a Forth to assembly compiler focused on only translating forth.f (and not Forth in general) to assembly.

compiler11.py contains a first version. Try to run in with an empty "main" function (provided in header11.asm).

python3 compiler11.py forth11.f | cat header11.asm - > forth.asm
nasm -f elf64 forth.asm -o forth.o ;and ld forth.o ;and ./a.out

in fish or

python3 compiler11.py forth11.f | cat header.asm - > forth.asm
nasm -f elf64 forth.asm -o forth.o && ld forth.o && ./a.out

in bash. (Replace ;and with && to run future commands in bash.) This gives a number of errors for the moment.

Implementation detail: This tutorial uses Intel syntax assembly and is tested on nasm, the Netwide Assembler.

(Instead of cat, we could use the include feature of nasm with %include "header.asm".)

After getting the basics of this compiler working, we'll more or less follow the steps for building the Python simulator by adding main_loop, input_loop and write_loop one by one.

Renaming

Some of our Forth function names are not valid in assembly. Let's change them.

Try the same thing with compiler12.py.

python3 compiler12.py forth11.f | cat header11.asm - > forth.asm
nasm -f elf64 forth.asm -o forth.o ;and ld forth.o ;and ./a.out

We should only see "symbol 'something' undefined" errors now instead of the mixed bag we had before. Before implementing these missing functions, we need to discuss how assembly works.

Assembly computation model

(This description isn't nearly as clean or complete as the one for Forth since we won't implement our own assembler.)

Assembly has

a fixed number of temporary variables (called "registers"),
a stack,
memory, an array of bits, and
an implicit hard-coded lookup table of primitives (called "instructions").

Memory holds both instructions and stored data. This is actually also the case for Forth but this tutorial will stop before getting to memory allocation. It is an array of single bits subdivided into smaller parts, each containing a fixed number of bits (such as 8 bits, 16 bits, 32 bits, etc). Registers hold a fixed number of bits (64 bits on x86_64).

Memory is first split into "section" that are either initializable to hard-coded constants or writable but not both. All our variables will be stored in the writable section.

Assembly uses a single stack which contains all elements that would normally be in both the call and data stack.

For this tutorial, we will pretend there are 16 of them with 4 reserved for the assembly stack location and system calls.

In the main loop for running assembled code, one primitive is read from memory (at the location indicated by register #6 (rsp) and is executed. Advance rsp to the next location and repeat.

Assembly has lots and lots of primitives ("assembly instructions") with hard coded parameters. We will only use a subset. The vast majority are named using 3 or 4 characters, from a time where source code size mattered and memory was scarce.

mov (assignment =)
add, sub, shl, shr (various augmented assignments: +=, -=, <<=, >>=)
inc, dec (+= 1 and -= 1)
cmp, test (<, >, ==, etc. However, the results are stored in "flags" instead of booleans)
push, pop (move data from the (only) stack to registers and back)
labels (such as some_label: marks location in memory to go to later)
jmp (goto a memory location)
call, ret (call a function and return from a function)
syscall (system calls)
lea (assignment to compute address value; this function exists as a consequence of notation used for other operations)

call is shorthand for pushing the current location onto the stack and going to it. ret pops the top of the stack expecting a memory location to return to and jumps to it. We have to be careful not to put something else at the top of the stack when calling ret.

A few more assembly instructions will be introduced as we need them.

More information about syntax and assembly in general is in the appendix.

Implementation choices

We make the following non-canonical decisions.

All integers are 64 bits.
All registers are named r1, r2, ..., r16. (r1=rax, r2=rcx, r3=rdx, r4=rbx)
All arrays are arrays of 64 bit integers (that can increase its size by append elements at the end), except for strings which will be stored in arrays of 8 bit integers.
memory_length is an integer containing the number of elements in memory_stack. The location memory_stack + 8*i contains the element at index i.
All stacks are implemented using arrays using similar naming conventions.
Whenever possible, we'll store array indices (instead of memory addresses).

All these arrays (and more) are declared in the writable .bss section of our assembly source.

We'll wrap comparisons and other tests so they push a boolean onto the data stack.

Data stack

We frequently push and pop from the data stack and we want to be able to push to and from registers. So we will copy-paste the same few lines when we push/pop from the data stack. We use assembly macros to avoid rewriting these instructions each time. They are named pushdata and popdata. We reserve rbp=r5 for these functions (so pushdata r5 and popdata r5 will behave incorrectly) but this register can otherwise be used in our program (although we won't).

Now we can write things like pushdata r1 (to push the value in register r1) and popdata r2 (to pop the data stack and r2 to the popped value).

Output

Output is more complicated in assembly. One way is to link the C library and call its printf function. For this tutorial, we will write our own print function from scratch using system calls.

To print, make a system call with four registers set:

eax=1 (meaning we want the "write" system call),
rdi=1 (meaning write to stdout),
rsi= address of the first character to print,
rdx= length of string to print.

(For us, rax is r1 and rdx is r3.)

We add the prints and print functions (for printing a hard-coded string and an integer at the top of the data stack). Also add printeol, printspace for printing specific single characters. The last two are not safe and will overwrite registers.

Let's stop here and test out our data pushing/popping and printing functions. Run forth13.asm. Running

nasm -f elf64 forth13.asm -o a.o ;and ld a.o ;and ./a.out

should give

Hello world
1000 in hex and reversed is: 8e300000 
0d700000 8bb00000

Easier debugging

Its possible to try and use gdb or ddd for debugging but I found them to behave unexpectedly too frequently when not used for debugging C code. They also need familiarity with lots of the tools' design decisions to get started properly.

Try this link if you want to give it a go.

Otherwise, we are stuck with debugging with print statements. Our program is small enough for it to be doable. This tutorial was made using only print statement debugging :| .

Main loop

As with our Python simulator, we'll first add enough to make main-loop run. (And then the other two loops although we'll only make part of input-loop work at first before moving to write-loop.)

Early binding optimization

This is the only optimization done in this tutorial.

Since string manipulation is more difficult in assembly, we change the content of memory from strings containing function names to the following .

For a non-primitive function, instead of its name as a string, we will write the functions' starting index in memory.
For a primitive function, we will write the functions' index in a (constant) primitive_function array.

For non-primitive functions, we are essentially calling names.get when writing to memory instead of when reading from it.

Remark: This changes the behaviour of function redefinition (function re-binding) from "late binding" to "early binding". That is, the old version of a function that's redefined will still be called by functions written to memory before its redefinition.

Compiled Forth

We'll use jmp ("jump" or "goto") instead of repeat. This also eliminates the need for call. To do this in Forth, we add label: and goto: functions, the first to add a label to the current location in the function body and the second to go to a named label. We will not implement them in our Forth interpreter so.

Call this version of Forth "compiled Forth" to distinguish it from "interpreted Forth" that our simulation and current interpreter will be able to run. Although we could also update our simulation with these extra instructions for easier debugging. We won't do that in this tutorial.

Alternative: Alternatively, we could implement repeat as in our Python implementation of repeat and change the call stack. This involves manually modifying the return address at the top of the assembly stack (and returning to the new modified address. call is trickier at this point since we don't have a names map in assembly yet.

Main loop

With main-loop rewritten in compiled Forth in forth.f, we can add all missing primitives: call_stack.len, call_stack.push, call_stack.pop, memory.get, if_else, equal, add, call_primitive, push1. We add them to header.asm.

We haven't decided what is a primitive yet (in the Python simulator, we tested this with callable) so is_primitive is hard coded to always return True for the moment.

Try running the files in the forth14 directory.

python compiler.py > forth.asm
nasm -f elf64 forth.asm -o forth.o ;and ld forth.o ;and ./a.out

which runs the hard-coded Forth program

push1 print push1 print return

A pregenerated forth.asm is available for all steps from this point on.

Choosing between compiled and hand-written assembly

We can now choose to either extend compiler.py to generate the entire assembly file or we can copy and paste its output to a .asm file we edit by hand.

Both approaches are equally good.

We pick the first approach but this guide should be just as good if the second one is chosen. Some functions will still have to be written entirely in assembly (to be either included or copied by compiler.py).

We will also have the choice of using compiler.py or nasm's macros as a preporcessor. We do a mix of the two for no particular reason. A different choice should also work just as well.

I find it a bit confusing to have preprocessor and assembly instructions mixed into the same file so I will put all preprocessor directives at the beginning.

Next token from input

Following the construction of our Python simulator, we want to get input-loop working but we first need to add functionality that we took for granted in Python. In fact, we aim to get only next-input working for now, which we write in compiled Forth which returns the next token. It works in 3 steps.

Read until we encounter a non-space.
Read until we encounter a space.
Write everything from 2 to a string.

See forth.f. We replaced any break by a jmp to the end of the loop. We now add the primitives to get next-input working.

Reading from input

To be able to run next-input, we'll need read_stdin that read from stdin and writes to input_buffer. The read system call needs four registers set:

eax=1 (the "read" system call),
rdi=1 (meaning stdin),
rsi= address to write the read input to,
rdx= max number of characters to read.

The call sets r1 (=rax) to the number of characters read. Write and test read_stdin separately first. See input.asm.

nasm -f elf64 input.asm -o test.o ;and ld test.o ;and ./a.out

Add the primitives next_char and is_input_end to process the content of the buffer (and in the case of next_char, also calls next-input if the buffer runs out).

Optionally test these primitives too. (Printing a character on the data stack will print its ASCII value.)

Common array methods

Add the strings array (which will contain all strings) and methods for it.

Instead of writing new methods for every array we add, we use compiler.py as a more flexible preprocessor (than nasm) and automatically generate array methods. We also move the .bss section containing array declarations from header.asm to compiler.py.

Alternative: Alternatively, we could write generic array method that takes an array as first argument (like a Python class method). In that case, we'd need some way to get its length (possibly by always putting the length at index -1 of the array).

We'll add some extra array functions that we don't need yet: array.set_at and array.address.

Next token from input

Finally, we add the primitives is_whitespace, if, not and sub.

Noticing that equal, is_input_end and not have common function bodies, we define a new nasm macro retbool for comparing and pushing a boolean onto the data stack.

To debug and visualize the results, add the print_string function which interprets the top of the stack as an address to a string and prints it.

next-input is ready! Try it in forth15.

python compiler.py > forth.asm
nasm -f elf64 forth.asm -o forth.o ;and ld forth.o ;and ./a.out

The main _start function runs

next-input strings.address print_string printeol

twice and should print the first two tokens of the input on separate lines (after prompting for input).

Write loop

As a reminder, input_loop will come at the very end. We'll now work towards getting write_loop working.

Names lookup

We'll first need a names dictionnary to decide what to write from input in the interpreter. Since assembly doesn't come with dictionnaries, we will write our own.

Our dictionnary will be very simple:

a names.keys array and
a names.values array of the same length.

We use a linear search for keys (instead of logarithmic): given a key, find its index in names.keys and return names.values at that index. In Forth, these functions can be written as:

[ names.keys.index names.values.get ] bind: names.get
[ names.keys.len ] bind: names.len
[ names.keys.append names.values.append ] bind: names.set

We don't even delete old keys (and values) and instead implement index to find the first occurence of the key from the back of the array.

This is implemented in forth.f. Try it in forth16.

python compiler.py > forth.asm
nasm -f elf64 forth.asm -o forth.o ;and ld forth.o ;and ./a.out

It sets names to {23: 0x100, 36: 0x200} and then looks up key 23.

However, this implementation only allows keys that are (64-bit) integers or characters and our keys will be strings. So we need to replace equal with some kind of string_equal in names.keys.index.

String comparison

To compare two strings, we first compare their length l and then the l following characters for equality.

Assembly instruction: repe cmpsb compares the first r2 entries of two arrays of bytes: one starting at address rsi and one starting at address rdi.

"Sets flags to true" if these r2 entries are equal and "sets flags to false" otherwise (I put quotes because it sets some flags so that je, jne, etc will work as expected afterwards).

We implement strings_equals using repe cmpsb in header.asm and change names.keys.index to use it. Try it in forth17.

python compiler.py > forth.asm
nasm -f elf64 forth.asm -o forth.o ;and ld forth.o ;and ./a.out

Its _start function runs

next-input strings.address next-input strings.address string_equal print

which compares the two first input tokens and prints whether they are equal.

Implementation detail: string_equal takes two addresses as parameters which goes against our convention of passing indices when possible. If we're sure all strings will come from the strings array then we could instead pass indices. But this makes some testing and debugging harder if we want to use some hard coded strings (they'd have to be copied to the strings array first).

Write loop

With all ingredients needed, we can now add write-loop to forth.f.

Change names.keys.index to assume keys are strings (by replacing equal with strings.address string_equal).

Try it in forth18.

python compiler.py > forth.asm
nasm -f elf64 forth.asm -o forth.o ;and ld forth.o ;and ./a.out

A sample run:

Enter 3 tokens that will be mapped to 0x100, 0x200 and 0x300:
> foo bar baz
Enter commands to write. The last token must be ']'
> foo foo bar bar baz bar foo baz ]
Writing 00100000 
Writing 00100000 
Writing 00200000 
Writing 00200000 
Writing 00300000 
Writing 00200000 
Writing 00100000 
Writing 00300000

Implementation detail: We added an equal_right_bracket function even though we could have called string_equal with ']' in its place.

Next time

In the next part, we'll add input-loop, complete the interpreter and then add some missing primitives to get a useful interpreter.

Appendix: Full primitives list

A list of primitives in the final version, arranged by type.

Three main loops (generated from Forth): main_loop, input_loop, write_loop
System calls: exit_error, exit0, read_stdin
I/O: next_char, next_input, next_command, print, print_string
Strings: string_equal
Binary: add, sub, not, equal
Control flow: return, if, call_stack.pop_many
Stack shuffling s11, s21, s2, s1212
Call stack, memory, strings and names methods: call_stack.is_empty, call_stack.pop, call_stack.push, memory.get, memory.set_at, memory.len, memory.append, strings.len, strings.set_at, strings.append, strings.pop, names.get, names.len, names.keys.get, names.values.get, names.set, strings.get, strings.address
Internal (could be hidden): setup
Auxiliary (could be hidden, some generated from Forth): exit_main_loop, primitive_case, names_found, names_not_found, space_loop, word_loop, next_char_is_space, is_input_end
Auxiliary (could be hidden): is_primitive, call_primitive, is_whitespace, raw_write_next_input

Appendix: Assembly

Syntax

Intel syntax for assembly instructions mostly follows the

instruction_name param1, param2, ...

pattern. Usually, there are at most two parameters. In Intel syntax the parameters are usually in the same left-to-right order as in Python

mov r1, r2 means r1 = r2
add r1, r2 means r1 += r2

Some instructions allow "computed addresses" as arguments such as [some_array] or [some_array + 8*r1] which we can think of as some_array[0] and some_array[r1] in Python. 8 is the number of bytes (=8 bits) used by each entry in the array. All our arrays entries are 8 bytes (= 8*8 = 64 bits) except for strings which are 1 byte (= 8 bits) arrays.

However, most instructions will not allow both arguments to be computed addresses so some register has to be used as temporary variable.

The lea instructions stores the computed address itself (instead of the value at that address in memory). lea r1, [some_array] stores the address of some_array in r1.

Registers bit length

A fixed number of variables and re-use means we have to pick some convention or else functions will keep overwriting each other's calls. Saving values that we don't want to lose is done by writing to memory or pushing to the stack.

Memory in assembly really is treated as an array of bits with many convenience functions/syntactic sugar for manipulation 64 consecutive bits at a time. In fact, there are some features for manipulating 8, 16 and 32 consecutive bits at a time as well. The register naming pattern is

al (8 bits), ax (16 bits), eax (32 bits), rax (64 bits)
bl (8 bits), bx (16 bits), ebx (32 bits), rbx (64 bits)

al and rax, say, are not separate variables. al is the last 8 bits of rax! So normally, only use one or the other at a time.

Furthermore, some convention was adopted about the "b" registers and not the other three "a", "c" and "d" (about who should save its value for calls to avoid being over-written).

This tutorial uses register names r1, r2, ..., r16 where r1=rax, r2=rcx, r3=rdx, r4=rbx, r5=rbp.

Assigning (moving) things of different bit length (length internally tracked by the programmer) is usually done by padding values with zeros using instructions like movzx ("Move zero extend"). Assigning/copying bits from memory requires indicating the bit length explicitly mov qword r1, [some_array + 8*r2].

The bit length indicators are qword (64 bits), dword (32 bits), word (16 bits), byte (8 bits).

Flags

When comparing, some flags are set instead of returning booleans. Then instruction for branching (jne, je, jz, jns, etc) behave differently depending on which flags are set. This allows making some programs very slightly faster because inc, dec and a few other non-comparison operations also set these flags as a side effect. This avoids the need for an extra comparison (at the cost of having to track these side effects in your head).

Labels, local labels

For jumping (goto) instructions, places in memory can be marked with labels. A local label (as opposed to a global label) starts with a dot (.) has scope between the last global label and next global label's declaration. So jumps inside a global labels "block" goes to a local label in the same block.

Labels are otherwise unique. There is no further nesting of scope (no "local local labels").

Glossary

location: Integer index, usually in the memory array (but possibly elsewhere like the strings array). Sometimes called pointers. function name: A string naming a function. This string can contain any character that's not a whitespace. They are call words in Forth.

Blog index

Another tutorial for writing a Forth interpreter in assembly (Part 2)

Assembly

Renaming

Assembly computation model

Implementation choices

Data stack

Output

Easier debugging

Main loop

Early binding optimization

Compiled Forth

Main loop

Choosing between compiled and hand-written assembly

Next token from input

Reading from input

Common array methods

Next token from input

Write loop

Names lookup

String comparison

Write loop

Next time

Appendix: Full primitives list

Appendix: Assembly

Syntax

Registers bit length

Flags

Labels, local labels

Related links

Glossary