Shortest introduction to the compiler design

2018/12/21

Authors: Peter Sovietov, Evgeniy Andreev

This article is part of F# Advent calendar in English 2018. You may find original article in Russian with Python examples here.

Introduction

Let’s build a compiler of arithmetic expressions in F#. The compiler should translate source code written in Reverse Polish notation (RPN) to an intermediate representation (IR) suitable for a stack machine. However, we’re not going to build an IR interpreter as you might think. Instead we are going to translate the IR right into C language code!

Syntactic analysis

type IR =
    | Push of int
    | Op of string

let scan (source: string) =
    let tokens = source.Split ' '
    [ for tok in tokens -> 
        match Int32.TryParse tok with
        | true, num -> Push num
        | _ -> Op tok  
    ]

What are we doing here? The scan function takes a line from the user in RPN, e.g. 2 2 +. Basically it is both the lexer and the parser of our compiler. There are various approaches: lexer and parser might be separated (e.g. generated by FsLex and FsYacc) or combined (e.g. implemented with parser combinators library, FParsec). Output of the scan function is the IR, a list of stack machine operations modeled with a discriminated union:

[ Push 2; 
  Push 2; 
  Op "+"
]

OK, this is already a working compiler, just not a serious one. For example, it’s possible to push incorrect operations onto the stack, so we need to introduce another discriminated union (like IR) for the possible operations in our compiler. Let’s leave that as an excercise. Don’t forget that we are going to generate C code as output!

Translation into C code

let trans (ir: IR list) =
    [ for instr in ir do
        match instr with
        | Push value -> 
            yield sprintf "  st[sp] = %d;" value
            yield "  sp += 1;"
        | Op op -> 
            yield sprintf "  st[sp - 2] = st[sp - 2] %s st[sp - 1];" op
            yield "  sp -= 1;" 
    ] |> String.concat "\n"

What’s going on? The trans function uses pattern matching to translate a list of IR instructions into C code. Pattern matching is a very convenient feature for compiler implementation. Let’s see the output for the 2 2 + expression.

  st[sp] = 2;
  sp += 1;
  st[sp] = 2;
  sp += 1;
  st[sp - 2] = st[sp - 2] + st[sp - 1];
  sp -= 1;

The st array is our stack, sp is the pointer to the last element of the stack. What’s left? We need some boilerplate C code for compilation by real C compiler.

Our first compiler

Let’s just put all things together:

open System

let [<Literal>] ST_SIZE = 100
let [<Literal>] C_CODE = """#include <stdio.h>
  int main(int argc, char** argv) {{ 
  int st[{0}], sp = 0;
{1}
  printf("%d\n", st[sp - 1]);
  return 0;
}}"""

type IR =
    | Push of int
    | Op of string

let scan (source: string) =
    let tokens = source.Split ' '
    [ for tok in tokens -> 
        match Int32.TryParse tok with
        | true, num -> Push num
        | _ -> Op tok  
    ]

let trans (ir: IR list) =
    [ for instr in ir do
        match instr with
        | Push value -> 
            yield sprintf "  st[sp] = %d;" value
            yield "  sp += 1;"
        | Op op -> 
            yield sprintf "  st[sp - 2] = st[sp - 2] %s st[sp - 1];" op
            yield "  sp -= 1;" 
    ] |> String.concat "\n"

let rpnToC (source: string) = 
    let code = source |> scan |> trans
    String.Format(C_CODE, ST_SIZE, code)

printfn "%s" (rpnToC "2 2 + 3 -")

The rpnToC function is our compilation pipeline. Literally! What’s the output?

#include <stdio.h>
int main(int argc, char** argv) {
  int st[100], sp = 0;
  st[sp] = 2;
  sp += 1;
  st[sp] = 2;
  sp += 1;
  st[sp - 2] = st[sp - 2] + st[sp - 1];
  sp -= 1;
  st[sp] = 3;
  sp += 1;
  st[sp - 2] = st[sp - 2] - st[sp - 1];
  sp -= 1;
  printf("%d\n", st[sp - 1]);
  return 0;
}

Let’s talk a little bit about the result. It’s weird that our compiler can translate only constant expressions. Clearly, we can compute it right in the compile-time, there’s no need to translate it at all. But consider that some arguments can come into stack from outside, e.g. from command line arguments. OK, let’s stop here. We can add command line processing to our implementation later. For now it’s more important to understand the compiler design at high-level.

SSA form

Static single assignment form sounds important for every compiler developer. What it is?

For now the compiler generates plain C code and doesn’t need a virtual machine. But why do we need to generate code working with the stack in the runtime? Let’s replace the stack array with plain variables. For every expression we will introduce a new one, and every variable should be assigned just once. Just no need to skimp on names. :)

And… this is SSA form! Our new compiler:

open System
open System.Text

let [<Literal>] C_CODE = """#include <stdio.h>
int main(int argc, char** argv) {{
{0}
  printf("%d\n", {1});
  return 0;
}}"""

type IR =
    | Push of int
    | Op of string

type Env = {
    stack: string list
    nameCounter: int
}

let emptyEnv = { stack = []; nameCounter = 0 }

let scan (source: string) =
    let tokens = source.Split ' '
    [ for tok in tokens -> 
        match Int32.TryParse tok with
        | true, num -> Push num
        | _ -> Op tok  
    ]

let trans (ir: IR list) =
    let transInstr (env: Env, code: StringBuilder) = function
    | Push value -> 
        let statement = sprintf "  int t%d = %d;" env.nameCounter value
        let code = code.AppendLine statement
        let stack = (sprintf "t%d" env.nameCounter) :: env.stack
        { env with stack = stack; nameCounter = env.nameCounter + 1}, code
    | Op op -> 
        let (leftOperand :: rightOperand :: stack) = env.stack
        let statement = 
            sprintf "  int t%d = %s %s %s;" 
                env.nameCounter 
				rightOperand 
				op 
				leftOperand
        let code = code.AppendLine statement
        let stack = (sprintf "t%d" env.nameCounter) :: stack
        { env with stack = stack; nameCounter = env.nameCounter + 1}, code

let rpnToC (source: string) = 
    let env, code = source |> scan |> trans
    String.Format(C_CODE, code, env.stack.Head)

printfn "%s" (rpnToC "2 2 + 3 -") 

Please keep in mind that there is no stack in the produced C code. The stack processing is performed in the compile-time. This is our internal stack for variables’ names (not values as in the previous version):

type Env = {
    stack: string list
    nameCounter: int
}

let emptyEnv = { stack = []; nameCounter = 0 }

The processing of the stack can be modeled through folding. This is a standard way in FP to perform stateful operations:

let trans (ir: IR list) =
    let transInstr (env: Env, code: StringBuilder) = function
    | Push value -> (* ... *)
    | Op op -> (* ... *)

    ir |> List.fold transInstr (emptyEnv, StringBuilder())

There is also some non-safe code:

let (leftOperand :: rightOperand :: stack) = env.stack

It fails when the stack does not contain enough elements. Clear error messages are another excercise for the reader. Compilation errors are a very important part of the compiler!

Final output:

#include <stdio.h>
int main(int argc, char** argv) {
  int t0 = 2;
  int t1 = 2;
  int t2 = t0 + t1;
  int t3 = 3;
  int t4 = t2 - t3;

  printf("%d\n", t4);
  return 0;
}

It seems like a good start for a more complicated compiler, right? We can go with stack languages like Forth, Postscript, Joy or Factor. Even statically typed functional language can be stack-based! Obviously another way is to increase syntax complexity. But let’s leave all these questions for later posts.

We wish you success in compiler design. Merry Christmas and Happy New Year!