[Author's note: When I started writing these posts, I thought that surely two parts will be enough. But it wasn't. Maybe I'll come back later and write a version assuming basic assembly is known.]
With the major part of Forth written in Forth, we can now them in "Forth-style" assembly (using call
/ret
). We will still have to write all the primitive functions in assembly.
We'll write a Forth to assembly compiler focused on only translating forth.f
(and not Forth in general) to assembly.
compiler11.py contains a first version. Try to run in with an empty "main" function (provided in header11.asm).
python3 compiler11.py forth11.f | cat header11.asm - > forth.asm
nasm -f elf64 forth.asm -o forth.o ;and ld forth.o ;and ./a.out
in fish or
python3 compiler11.py forth11.f | cat header.asm - > forth.asm
nasm -f elf64 forth.asm -o forth.o && ld forth.o && ./a.out
in bash. (Replace ;and
with &&
to run future commands in bash.) This gives a number of errors for the moment.
Implementation detail: This tutorial uses Intel syntax assembly and is tested on nasm, the Netwide Assembler.
(Instead of cat
, we could use the include
feature of nasm with %include "header.asm"
.)
After getting the basics of this compiler working, we'll more or less follow the steps for building the Python simulator by adding main_loop
, input_loop
and write_loop
one by one.
Some of our Forth function names are not valid in assembly. Let's change them.
Try the same thing with compiler12.py.
python3 compiler12.py forth11.f | cat header11.asm - > forth.asm
nasm -f elf64 forth.asm -o forth.o ;and ld forth.o ;and ./a.out
We should only see "symbol 'something' undefined" errors now instead of the mixed bag we had before. Before implementing these missing functions, we need to discuss how assembly works.
(This description isn't nearly as clean or complete as the one for Forth since we won't implement our own assembler.)
Assembly has
Memory holds both instructions and stored data. This is actually also the case for Forth but this tutorial will stop before getting to memory allocation. It is an array of single bits subdivided into smaller parts, each containing a fixed number of bits (such as 8 bits, 16 bits, 32 bits, etc). Registers hold a fixed number of bits (64 bits on x86_64).
Memory is first split into "section" that are either initializable to hard-coded constants or writable but not both. All our variables will be stored in the writable section.
Assembly uses a single stack which contains all elements that would normally be in both the call and data stack.
For this tutorial, we will pretend there are 16 of them with 4 reserved for the assembly stack location and system calls.
In the main loop for running assembled code, one primitive is read from memory (at the location indicated by register #6 (rsp
) and is executed. Advance rsp
to the next location and repeat.
Assembly has lots and lots of primitives ("assembly instructions") with hard coded parameters. We will only use a subset. The vast majority are named using 3 or 4 characters, from a time where source code size mattered and memory was scarce.
mov
(assignment =
)add
, sub
, shl
, shr
(various augmented assignments: +=
, -=
, <<=
, >>=
)inc
, dec
(+= 1
and -= 1
)cmp
, test
(<
, >
, ==
, etc. However, the results are stored in "flags" instead of booleans)push
, pop
(move data from the (only) stack to registers and back)some_label:
marks location in memory to go to later)jmp
(goto a memory location)call
, ret
(call a function and return from a function)syscall
(system calls)lea
(assignment to compute address value; this function exists as a consequence of notation used for other operations)call
is shorthand for pushing the current location onto the stack and going to it. ret
pops the top of the stack expecting a memory location to return to and jumps to it. We have to be careful not to put something else at the top of the stack when calling ret
.
A few more assembly instructions will be introduced as we need them.
More information about syntax and assembly in general is in the appendix.
We make the following non-canonical decisions.
r1, r2, ..., r16
. (r1=rax, r2=rcx, r3=rdx, r4=rbx
)memory_length
is an integer containing the number of elements in memory_stack
. The location memory_stack + 8*i
contains the element at index i
.All these arrays (and more) are declared in the writable .bss
section of our assembly source.
We'll wrap comparisons and other tests so they push a boolean onto the data stack.
We frequently push and pop from the data stack and we want to be able to push to and from registers. So we will copy-paste the same few lines when we push/pop from the data stack. We use assembly macros to avoid rewriting these instructions each time. They are named pushdata
and popdata
. We reserve rbp
=r5
for these functions (so pushdata r5
and popdata r5
will behave incorrectly) but this register can otherwise be used in our program (although we won't).
Now we can write things like pushdata r1
(to push the value in register r1
) and popdata r2
(to pop the data stack and r2
to the popped value).
Output is more complicated in assembly. One way is to link the C library and call its printf
function. For this tutorial, we will write our own print function from scratch using system calls.
To print, make a system call with four registers set:
eax=1
(meaning we want the "write" system call),rdi=1
(meaning write to stdout),rsi=
address of the first character to print,rdx=
length of string to print.(For us, rax
is r1
and rdx
is r3
.)
We add the prints
and print
functions (for printing a hard-coded string and an integer at the top of the data stack). Also add printeol
, printspace
for printing specific single characters. The last two are not safe and will overwrite registers.
Let's stop here and test out our data pushing/popping and printing functions. Run forth13.asm. Running
nasm -f elf64 forth13.asm -o a.o ;and ld a.o ;and ./a.out
should give
Hello world
1000 in hex and reversed is: 8e300000
0d700000 8bb00000
Its possible to try and use gdb or ddd for debugging but I found them to behave unexpectedly too frequently when not used for debugging C code. They also need familiarity with lots of the tools' design decisions to get started properly.
Try this link if you want to give it a go.
Otherwise, we are stuck with debugging with print statements. Our program is small enough for it to be doable. This tutorial was made using only print statement debugging :|
.
As with our Python simulator, we'll first add enough to make main-loop
run. (And then the other two loops although we'll only make part of input-loop
work at first before moving to write-loop
.)
This is the only optimization done in this tutorial.
Since string manipulation is more difficult in assembly, we change the content of memory
from strings containing function names to the following .
memory
.primitive_function
array.For non-primitive functions, we are essentially calling names.get
when writing to memory instead of when reading from it.
Remark: This changes the behaviour of function redefinition (function re-binding) from "late binding" to "early binding". That is, the old version of a function that's redefined will still be called by functions written to memory before its redefinition.
We'll use jmp
("jump" or "goto") instead of repeat
. This also eliminates the need for call
. To do this in Forth, we add label:
and goto:
functions, the first to add a label to the current location in the function body and the second to go to a named label. We will not implement them in our Forth interpreter so.
Call this version of Forth "compiled Forth" to distinguish it from "interpreted Forth" that our simulation and current interpreter will be able to run. Although we could also update our simulation with these extra instructions for easier debugging. We won't do that in this tutorial.
Alternative: Alternatively, we could implement repeat
as in our Python implementation of repeat
and change the call stack. This involves manually modifying the return address at the top of the assembly stack (and returning to the new modified address. call
is trickier at this point since we don't have a names
map in assembly yet.
With main-loop
rewritten in compiled Forth in forth.f, we can add all missing primitives: call_stack.len, call_stack.push, call_stack.pop, memory.get, if_else, equal, add, call_primitive, push1
. We add them to header.asm.
We haven't decided what is a primitive yet (in the Python simulator, we tested this with callable
) so is_primitive
is hard coded to always return True for the moment.
Try running the files in the forth14 directory.
python compiler.py > forth.asm
nasm -f elf64 forth.asm -o forth.o ;and ld forth.o ;and ./a.out
which runs the hard-coded Forth program
push1 print push1 print return
A pregenerated forth.asm is available for all steps from this point on.
We can now choose to either extend compiler.py
to generate the entire assembly file or we can copy and paste its output to a .asm
file we edit by hand.
Both approaches are equally good.
We pick the first approach but this guide should be just as good if the second one is chosen. Some functions will still have to be written entirely in assembly (to be either include
d or copied by compiler.py
).
We will also have the choice of using compiler.py
or nasm's macros as a preporcessor. We do a mix of the two for no particular reason. A different choice should also work just as well.
I find it a bit confusing to have preprocessor and assembly instructions mixed into the same file so I will put all preprocessor directives at the beginning.
Following the construction of our Python simulator, we want to get input-loop
working but we first need to add functionality that we took for granted in Python. In fact, we aim to get only next-input
working for now, which we write in compiled Forth which returns the next token. It works in 3 steps.
See forth.f. We replaced any break
by a jmp
to the end of the loop. We now add the primitives to get next-input
working.
To be able to run next-input
, we'll need read_stdin
that read from stdin and writes to input_buffer
. The read system call needs four registers set:
eax=1
(the "read" system call),rdi=1
(meaning stdin),rsi=
address to write the read input to,rdx=
max number of characters to read.The call sets r1
(=rax
) to the number of characters read. Write and test read_stdin
separately first. See input.asm.
nasm -f elf64 input.asm -o test.o ;and ld test.o ;and ./a.out
Add the primitives next_char
and is_input_end
to process the content of the buffer (and in the case of next_char
, also calls next-input
if the buffer runs out).
Optionally test these primitives too. (Printing a character on the data stack will print its ASCII value.)
Add the strings
array (which will contain all strings
) and methods for it.
Instead of writing new methods for every array we add, we use compiler.py as a more flexible preprocessor (than nasm) and automatically generate array methods. We also move the .bss
section containing array declarations from header.asm
to compiler.py.
Alternative: Alternatively, we could write generic array method that takes an array as first argument (like a Python class method). In that case, we'd need some way to get its length (possibly by always putting the length at index -1 of the array).
We'll add some extra array functions that we don't need yet: array.set_at
and array.address
.
Finally, we add the primitives is_whitespace
, if
, not
and sub
.
Noticing that equal
, is_input_end
and not
have common function bodies, we define a new nasm
macro retbool
for comparing and pushing a boolean onto the data stack.
To debug and visualize the results, add the print_string
function which interprets the top of the stack as an address to a string and prints it.
next-input
is ready! Try it in forth15.
python compiler.py > forth.asm
nasm -f elf64 forth.asm -o forth.o ;and ld forth.o ;and ./a.out
The main _start
function runs
next-input strings.address print_string printeol
twice and should print the first two tokens of the input on separate lines (after prompting for input).
As a reminder, input_loop
will come at the very end. We'll now work towards getting write_loop
working.
We'll first need a names
dictionnary to decide what to write from input in the interpreter. Since assembly doesn't come with dictionnaries, we will write our own.
Our dictionnary will be very simple:
names.keys
array andnames.values
array of the same length.We use a linear search for keys (instead of logarithmic): given a key, find its index in names.keys
and return names.values
at that index. In Forth, these functions can be written as:
[ names.keys.index names.values.get ] bind: names.get
[ names.keys.len ] bind: names.len
[ names.keys.append names.values.append ] bind: names.set
We don't even delete old keys (and values) and instead implement index to find the first occurence of the key from the back of the array.
This is implemented in forth.f. Try it in forth16.
python compiler.py > forth.asm
nasm -f elf64 forth.asm -o forth.o ;and ld forth.o ;and ./a.out
It sets names
to {23: 0x100, 36: 0x200}
and then looks up key 23.
However, this implementation only allows keys that are (64-bit) integers or characters and our keys will be strings. So we need to replace equal
with some kind of string_equal
in names.keys.index
.
To compare two strings, we first compare their length l
and then the l
following characters for equality.
Assembly instruction: repe cmpsb
compares the first r2
entries of two arrays of bytes: one starting at address rsi
and one starting at address rdi
.
"Sets flags to true" if these r2
entries are equal and "sets flags to false" otherwise (I put quotes because it sets some flags so that je
, jne
, etc will work as expected afterwards).
We implement strings_equals
using repe cmpsb
in header.asm and change names.keys.index
to use it. Try it in forth17.
python compiler.py > forth.asm
nasm -f elf64 forth.asm -o forth.o ;and ld forth.o ;and ./a.out
Its _start
function runs
next-input strings.address next-input strings.address string_equal print
which compares the two first input tokens and prints whether they are equal.
Implementation detail: string_equal
takes two addresses as parameters which goes against our convention of passing indices when possible. If we're sure all strings will come from the strings
array then we could instead pass indices. But this makes some testing and debugging harder if we want to use some hard coded strings (they'd have to be copied to the strings
array first).
With all ingredients needed, we can now add write-loop
to forth.f.
Change names.keys.index
to assume keys are strings (by replacing equal
with strings.address string_equal
).
Try it in forth18.
python compiler.py > forth.asm
nasm -f elf64 forth.asm -o forth.o ;and ld forth.o ;and ./a.out
A sample run:
Enter 3 tokens that will be mapped to 0x100, 0x200 and 0x300:
> foo bar baz
Enter commands to write. The last token must be ']'
> foo foo bar bar baz bar foo baz ]
Writing 00100000
Writing 00100000
Writing 00200000
Writing 00200000
Writing 00300000
Writing 00200000
Writing 00100000
Writing 00300000
Implementation detail: We added an equal_right_bracket
function even though we could have called string_equal
with ']' in its place.
In the next part, we'll add input-loop
, complete the interpreter and then add some missing primitives to get a useful interpreter.
A list of primitives in the final version, arranged by type.
main_loop, input_loop, write_loop
exit_error, exit0, read_stdin
next_char, next_input, next_command, print, print_string
string_equal
add, sub, not, equal
return, if, call_stack.pop_many
s11, s21, s2, s1212
call_stack.is_empty, call_stack.pop, call_stack.push, memory.get, memory.set_at, memory.len, memory.append, strings.len, strings.set_at, strings.append, strings.pop, names.get, names.len, names.keys.get, names.values.get, names.set, strings.get, strings.address
setup
exit_main_loop, primitive_case, names_found, names_not_found, space_loop, word_loop, next_char_is_space, is_input_end
is_primitive, call_primitive, is_whitespace, raw_write_next_input
Intel syntax for assembly instructions mostly follows the
instruction_name param1, param2, ...
pattern. Usually, there are at most two parameters. In Intel syntax the parameters are usually in the same left-to-right order as in Python
mov r1, r2
means r1 = r2
add r1, r2
means r1 += r2
Some instructions allow "computed addresses" as arguments such as [some_array]
or [some_array + 8*r1]
which we can think of as some_array[0]
and some_array[r1]
in Python. 8 is the number of bytes (=8 bits) used by each entry in the array. All our arrays entries are 8 bytes (= 8*8 = 64 bits) except for strings which are 1 byte (= 8 bits) arrays.
However, most instructions will not allow both arguments to be computed addresses so some register has to be used as temporary variable.
The lea
instructions stores the computed address itself (instead of the value at that address in memory). lea r1, [some_array]
stores the address of some_array
in r1
.
A fixed number of variables and re-use means we have to pick some convention or else functions will keep overwriting each other's calls. Saving values that we don't want to lose is done by writing to memory or pushing to the stack.
Memory in assembly really is treated as an array of bits with many convenience functions/syntactic sugar for manipulation 64 consecutive bits at a time. In fact, there are some features for manipulating 8, 16 and 32 consecutive bits at a time as well. The register naming pattern is
al
(8 bits), ax
(16 bits), eax
(32 bits), rax
(64 bits)bl
(8 bits), bx
(16 bits), ebx
(32 bits), rbx
(64 bits)al
and rax
, say, are not separate variables. al
is the last 8 bits of rax
! So normally, only use one or the other at a time.
Furthermore, some convention was adopted about the "b" registers and not the other three "a", "c" and "d" (about who should save its value for calls to avoid being over-written).
This tutorial uses register names r1, r2, ..., r16
where r1=rax, r2=rcx, r3=rdx, r4=rbx, r5=rbp
.
Assigning (mov
ing) things of different bit length (length internally tracked by the programmer) is usually done by padding values with zeros using instructions like movzx
("Move zero extend"). Assigning/copying bits from memory requires indicating the bit length explicitly mov qword r1, [some_array + 8*r2]
.
The bit length indicators are qword
(64 bits), dword
(32 bits), word
(16 bits), byte
(8 bits).
When comparing, some flags are set instead of returning booleans. Then instruction for branching (jne
, je
, jz
, jns
, etc) behave differently depending on which flags are set. This allows making some programs very slightly faster because inc
, dec
and a few other non-comparison operations also set these flags as a side effect. This avoids the need for an extra comparison (at the cost of having to track these side effects in your head).
For jumping (goto) instructions, places in memory can be marked with labels. A local label (as opposed to a global label) starts with a dot (.
) has scope between the last global label and next global label's declaration. So jumps inside a global labels "block" goes to a local label in the same block.
Labels are otherwise unique. There is no further nesting of scope (no "local local labels").
location: Integer index, usually in the memory
array (but possibly elsewhere like the strings
array). Sometimes called pointers.
function name: A string naming a function. This string can contain any character that's not a whitespace. They are call words in Forth.