Authors: Peter Sovietov, Evgeniy Andreev
This article is part of F# Advent calendar in English 2018. You may find original article in Russian with Python examples here.
Introduction
Let’s build a compiler of arithmetic expressions in F#. The compiler should translate source code written in Reverse Polish notation (RPN) to an intermediate representation (IR) suitable for a stack machine. However, we’re not going to build an IR interpreter as you might think. Instead we are going to translate the IR right into C language code!
Syntactic analysis
type IR =
| Push of int
| Op of string
let scan (source: string) =
let tokens = source.Split ' '
[ for tok in tokens ->
match Int32.TryParse tok with
| true, num -> Push num
| _ -> Op tok
]
What are we doing here? The scan
function takes a line from the user in RPN, e.g. 2 2 +
. Basically it is both the lexer and the parser of our compiler. There are various approaches: lexer and parser might be separated (e.g. generated by FsLex and FsYacc) or combined (e.g. implemented with parser combinators library, FParsec). Output of the scan
function is the IR, a list of stack machine operations modeled with a discriminated union:
[ Push 2;
Push 2;
Op "+"
]
OK, this is already a working compiler, just not a serious one. For example, it’s possible to push incorrect operations onto the stack, so we need to introduce another discriminated union (like IR
) for the possible operations in our compiler. Let’s leave that as an excercise. Don’t forget that we are going to generate C code as output!
Translation into C code
let trans (ir: IR list) =
[ for instr in ir do
match instr with
| Push value ->
yield sprintf " st[sp] = %d;" value
yield " sp += 1;"
| Op op ->
yield sprintf " st[sp - 2] = st[sp - 2] %s st[sp - 1];" op
yield " sp -= 1;"
] |> String.concat "\n"
What’s going on? The trans
function uses pattern matching to translate a list of IR instructions into C code. Pattern matching is a very convenient feature for compiler implementation. Let’s see the output for the 2 2 +
expression.
st[sp] = 2;
sp += 1;
st[sp] = 2;
sp += 1;
st[sp - 2] = st[sp - 2] + st[sp - 1];
sp -= 1;
The st
array is our stack, sp
is the pointer to the last element of the stack. What’s left? We need some boilerplate C code for compilation by real C compiler.
Our first compiler
Let’s just put all things together:
open System
let [<Literal>] ST_SIZE = 100
let [<Literal>] C_CODE = """#include <stdio.h>
int main(int argc, char** argv) {{
int st[{0}], sp = 0;
{1}
printf("%d\n", st[sp - 1]);
return 0;
}}"""
type IR =
| Push of int
| Op of string
let scan (source: string) =
let tokens = source.Split ' '
[ for tok in tokens ->
match Int32.TryParse tok with
| true, num -> Push num
| _ -> Op tok
]
let trans (ir: IR list) =
[ for instr in ir do
match instr with
| Push value ->
yield sprintf " st[sp] = %d;" value
yield " sp += 1;"
| Op op ->
yield sprintf " st[sp - 2] = st[sp - 2] %s st[sp - 1];" op
yield " sp -= 1;"
] |> String.concat "\n"
let rpnToC (source: string) =
let code = source |> scan |> trans
String.Format(C_CODE, ST_SIZE, code)
printfn "%s" (rpnToC "2 2 + 3 -")
The rpnToC
function is our compilation pipeline. Literally! What’s the output?
#include <stdio.h>
int main(int argc, char** argv) {
int st[100], sp = 0;
st[sp] = 2;
sp += 1;
st[sp] = 2;
sp += 1;
st[sp - 2] = st[sp - 2] + st[sp - 1];
sp -= 1;
st[sp] = 3;
sp += 1;
st[sp - 2] = st[sp - 2] - st[sp - 1];
sp -= 1;
printf("%d\n", st[sp - 1]);
return 0;
}
Let’s talk a little bit about the result. It’s weird that our compiler can translate only constant expressions. Clearly, we can compute it right in the compile-time, there’s no need to translate it at all. But consider that some arguments can come into stack from outside, e.g. from command line arguments. OK, let’s stop here. We can add command line processing to our implementation later. For now it’s more important to understand the compiler design at high-level.
SSA form
Static single assignment form sounds important for every compiler developer. What it is?
For now the compiler generates plain C code and doesn’t need a virtual machine. But why do we need to generate code working with the stack in the runtime? Let’s replace the stack array with plain variables. For every expression we will introduce a new one, and every variable should be assigned just once. Just no need to skimp on names. :)
And… this is SSA form! Our new compiler:
open System
open System.Text
let [<Literal>] C_CODE = """#include <stdio.h>
int main(int argc, char** argv) {{
{0}
printf("%d\n", {1});
return 0;
}}"""
type IR =
| Push of int
| Op of string
type Env = {
stack: string list
nameCounter: int
}
let emptyEnv = { stack = []; nameCounter = 0 }
let scan (source: string) =
let tokens = source.Split ' '
[ for tok in tokens ->
match Int32.TryParse tok with
| true, num -> Push num
| _ -> Op tok
]
let trans (ir: IR list) =
let transInstr (env: Env, code: StringBuilder) = function
| Push value ->
let statement = sprintf " int t%d = %d;" env.nameCounter value
let code = code.AppendLine statement
let stack = (sprintf "t%d" env.nameCounter) :: env.stack
{ env with stack = stack; nameCounter = env.nameCounter + 1}, code
| Op op ->
let (leftOperand :: rightOperand :: stack) = env.stack
let statement =
sprintf " int t%d = %s %s %s;"
env.nameCounter
rightOperand
op
leftOperand
let code = code.AppendLine statement
let stack = (sprintf "t%d" env.nameCounter) :: stack
{ env with stack = stack; nameCounter = env.nameCounter + 1}, code
let rpnToC (source: string) =
let env, code = source |> scan |> trans
String.Format(C_CODE, code, env.stack.Head)
printfn "%s" (rpnToC "2 2 + 3 -")
Please keep in mind that there is no stack in the produced C code. The stack processing is performed in the compile-time. This is our internal stack for variables’ names (not values as in the previous version):
type Env = {
stack: string list
nameCounter: int
}
let emptyEnv = { stack = []; nameCounter = 0 }
The processing of the stack can be modeled through folding. This is a standard way in FP to perform stateful operations:
let trans (ir: IR list) =
let transInstr (env: Env, code: StringBuilder) = function
| Push value -> (* ... *)
| Op op -> (* ... *)
ir |> List.fold transInstr (emptyEnv, StringBuilder())
There is also some non-safe code:
let (leftOperand :: rightOperand :: stack) = env.stack
It fails when the stack does not contain enough elements. Clear error messages are another excercise for the reader. Compilation errors are a very important part of the compiler!
Final output:
#include <stdio.h>
int main(int argc, char** argv) {
int t0 = 2;
int t1 = 2;
int t2 = t0 + t1;
int t3 = 3;
int t4 = t2 - t3;
printf("%d\n", t4);
return 0;
}
It seems like a good start for a more complicated compiler, right? We can go with stack languages like Forth, Postscript, Joy or Factor. Even statically typed functional language can be stack-based! Obviously another way is to increase syntax complexity. But let’s leave all these questions for later posts.
We wish you success in compiler design. Merry Christmas and Happy New Year!