A naive C compiler
For several years, on and off, I’ve been writing a C compiler toolchain. It’s completely standalone, and includes its own SSA form intermediate representation, its own libc, and even its own linker. It’s written in C, and is complete enough to self-host. The point of this project was to learn about compilers, so I thought I’d write about it; design decisions, what went well, what didn’t, what things I’d change. I’m far from an expert on compilers and it’s a very deep field, so take all of the following with a grain of salt (and please send me corrections on anything I’m wrong about).
Background
After playing around with interpreters for various esoteric languages and writing a self-hosting LLVM-based compiler for a gradually-typed Lisp dialect for my university thesis, I wanted to write a compiler for a “real” language.
C seemed a good choice. It compiles down very naturally to straightforward code with minimal runtime requirements. It’s definitely “real”, with a specification, large body of existing code, and several high-quality reference compilers. I then chose to write the compiler itself in C because self-hosting is such a fun concept, and also because it defines a natural victory condition.
Somewhere along the way I decided to write the full toolchain, not just a compiler. It just felt disappointing to write a C compiler that shells out to another C compiler, even if it is just for assembling and linking. Besides, writing an assembler is pretty easy compared to the rest of the compiler. Even writing a basic linker is not that bad.
It’s called “naive” because this reflects my approach. I’ve always learned best by doing, so I didn’t want to spend ages reading tons of theory before starting. I’d rather just dive in and do things in the obvious, naive way, and focus any research on solving concrete problems after gaining context from running headfirst into them. So “naive” is both a name and a description.
I started by writing a skeleton of each phase of the compiler, writing the bare minimum necessary to compile the program int main(void) { return 0; }
into an executable. Then I just added features one-by-one, extending each phase as necessary to support that feature. This was a good approach and got me past the initial overwhelming set of things to implement and into solving specific problems. I started working on it in earnest in March of 2016. 17 months, 707 commits and 15.7k lines of code later, it was self-hosting. I got interested again in 2024 and returned to improve it and add some more features, primarily floating point support. I’ve gotten a bit tired of working on it now for reasons I’ll discuss later, so this seemed like a good point to stop and reflect.
Infrastructure
Testing
The testing strategy is a little unusual in that there are zero unit tests, but a lot of integration tests (221). This is for a few reasons, the most obvious being that unit testing in C is a pain (particularly since I didn’t want to use any external libraries). But a better reason is that a C compiler has a very stable and well-specified interface to test against, which makes integration tests pleasant to write and easy to maintain. The downside is that it’s harder to identify where bugs are than with more tightly focused unit tests, but the natural layering of the compiler usually makes it fairly easy to narrow things down.
The integration tests are what you might call “golden” tests – there’s a set of input C files along with corresponding expected outputs, and the testing system compiles each file and runs the resulting binary (as long as it’s expected to successfully compile), checking return codes and output streams for each of these steps.
This is such a natural form of compiler testing that I expect there are as many implementations as there are compilers. Mine is a Python script which scans through the files in the tests
directory and runs the tests. It also supports a few other features like saving the current output as expected output. So writing a new test is as simple as writing the input file, dropping it in the directory, and running the script with a special argument to generate the output data. It interacts with the compiler only via its command-line interface, so any internal refactoring has no effect on the tests, which can remain unchanged.
This approach is also nice in that it encourages you to invest in your debug tooling. Since you can very conveniently add new tests which use the command line interface, you benefit from making that command line interface as rich as possible. If you have flags which enable extra debug output, you can make more narrowly scoped tests by testing the output with these flags enabled. You can take it even further and make new binaries (or modes for the main binary) which use the debug output at some intermediate stage as input, rather than working from C source code. All of this comes in handy when debugging, as you have lots of entry points, nice debug output, and lots of command line control which you can use when running by hand as well as from the test framework.
Overall this approach worked very well, and I’d do the same thing again. It’s worth investing time into making a nice testing system like this up-front, so that adding tests is as painless as possible. Next time, I’d invest even more time into it up-front, including features like nice diffing for test failures and the ability to organise tests better. I’d also commit more to making reusable debug output. While I did this pretty well, there were some cases where I left in commented out code for debug purposes, or added stuff on an ad hoc basis. Having nice flag infrastructure where you can easily add flags in modules without having to plumb it through all the way from main
would be a great help.
Data structures
C does not have generics or any built-in data structures besides fixed-length arrays. It wasn’t feasible to do everything with bare arrays, but without generics it’s not possible to write nice general purpose data structures that can contain any type.
My compromise was to make one kinda janky generic data structure and use it for anything I could. It’s a simple dynamically growing array called Array
. The raw data type looks like:
typedef struct Array_
{
u8 *elements;
u32 size;
u32 capacity;
} Array_;
I use it via a macro #define Array(T) Array_
which just drops the element type, but allows you to specify it directly as part of the type declaration for documentation purposes.
Array
knows nothing about the type of the elements it contains, it just has a raw byte array and a size and capacity in bytes. Almost all operations on Array
s need to know this size, so you have to explicitly pass it in. Rather than passing a size directly, I made macros which either do the operation in-line or pass the sizeof
the type to a generic implementation. For instance, take ARRAY_REF
which returns a pointer to the element at a given index:
#define ARRAY_REF(array, element_type, i) \
((element_type *)&(array)->elements[(i) * sizeof(element_type)])
Or ARRAY_REMOVE
, which removes the element at a given index, shifting the following elements down:
#define ARRAY_REMOVE(array, element_type, removal_point) \
_array_remove((array), sizeof(element_type), (removal_point))
This does require you to correctly specify the type name every time you use it, which is a source of bugs when you get it wrong. It’s not ideal, but it’s a decent enough compromise given the tools available.
The other data structure I used quite a bit is Pool
. This is a simple arena used to batch together memory allocations and free them all at once. It’s used to avoid fragmentation when making lots of small allocations, but mainly I used it to simplify memory management. In many situations I have complex trees of data containing lots of individual objects all with the same lifetime, and it’s a lot simpler to deal with when you can put them all in the same pool and free the pool in one go rather than having to traverse the data structure and free everything individually (making sure to free all child objects first so the parent object is still valid, avoid double freeing things which have multiple references, etc.). This is pretty common because of the phased nature of compilation. Generally each phase contains lots of cross-references between everything used in that phase, so you have to keep it all alive until the next phase. But after the phase is done, you know you don’t need any of it anymore, so you can throw it all away at once.
It looks like this:
typedef struct PoolBlock
{
size_t used;
u8 *memory;
struct PoolBlock *next;
} PoolBlock;
typedef struct Pool
{
size_t block_size;
PoolBlock *first_block;
PoolBlock *first_block_with_space;
} Pool;
Allocations go into the first block with space remaining. If the latest block can’t contain the allocation, we malloc
a new one and set the pointers accordingly. To free the pool we traverse the links from first_block
, freeing blocks one by one. In theory this can result in lots of wasted memory for certain patterns of allocations (e.g. 1 byte and then block_size
bytes alternated repeatedly would use ~twice the necessary memory), but keeping it fast by avoiding searching for the first block is preferred, as in practice block_size
is large and allocations are small.
I also added a special case for the clang sanitizers. When running under a sanitizer we replace the pool with an Array
of pointers and allocate each object individually. This allows the sanitizers to precisely track allocations and makes it much easier to understand the bugs they find. When it points to an allocation we get the exact pool_alloc
call that allocated the object in question, rather than the last call that happened to allocate the block the object is in.
Overall I’m happy with how Pool
s were used, but I think I did overuse them slightly. They’re ideal when the lifetime of everything contained really is the same, and you can’t free anything in it until you free all of it. When this isn’t the case, you can waste memory by keeping around objects which you know you don’t need anymore. Its convenience makes it tempting to use even in such cases, and there were a few where another approach probably would have been better.
Phases
One of the nicest things about a compiler as a software project is how clear the layering is. Transforming C source code into a binary is a series of phases, each of which takes the output from the previous step and transforms it into a form slightly further from textual source code and slightly closer to a binary. This makes for a very nice separation of concerns. For naive, I’d break these steps down as follows:
C source code -> preprocessed source code -> tokens -> AST -> IR -> assembly (virtual registers) -> assembly (physical registers) -> object file -> executable
Most of these have nice textual debug output, so lets run a simple source file through them as we talk about each stage. I’ve trimmed some of the output down as it can get quite verbose.
Here’s the initial C source code we’re working with:
#define INT_MIN (-2147483648)
int max(int *numbers, int size) {
int m = INT_MIN;
for (int i = 0; i < size; i++) {
if (numbers[i] > m) {
m = numbers[i];
}
}
return m;
}
Frontend
Preprocessor
$ ncc -E max.c
Preprocessed source code:
int max(int *numbers, int size) {
int m = (-2147483648);
for (int i = 0; i < size; i++) {
if (numbers[i] > m) {
m = numbers[i];
}
}
return m;
}
The preprocessor was initially integrated into the lexer, essentially just issuing directives that inject text into the lexer, or modify state within the lexer. This worked fine for a while and even seemed quite elegant. But as I added preprocessor features it grew untenable and I separated out into its own phase, which takes in text and outputs more text. We can’t just output text though, as otherwise our error messages will refer to locations in the preprocessed text that don’t line up with the original input. I didn’t want to attach source locations to every single character, so it produces a side output of “adjustments”. These each have an index into the preprocessed text and adjust the lexer’s current source location to deal with offsets created by preprocessing.
This worked okay, but running the preprocessor directly from raw text is a real pain. You’re doing manual character-level scanning all over the place and having to handle whitespace/comments and lex symbols and strings in the middle of handling preprocessor directives. Because of this there are a few corner cases in macro expansion (in particular involving argument prescan) which would be quite a pain to deal with and I’ve simply avoided. An extra “preprocessor lexing” phase would have been better.
The C standard has a concept of “preprocessing tokens”, which the source is lexed into before being fed into the preprocessor. It’s sometimes tricky to decide whether or not to implement something as written in the standard, as the standard is often written for ease of specification/understanding and isn’t practical or necessary to implement directly as written. Preprocessing tokens felt unnecessary to me. But ultimately I think it would be better to first lex into preprocessor tokens, as it would make a correct preprocessor implementation a lot easier and less fiddly. It would make for a better separation of concerns, as this stage of lexing would be the only part that has to care about character-level manipulation of the source text.
Preprocessor tokens are a bit awkward though. They include a lot of normal things like numbers, strings, identifiers, but the final type in the list is “each non-white-space character that cannot be one of the above”. This is necessary as lexing into preprocessor tokens happens before executing any directives. A valid C program can contain #if 0 ... #endif
blocks, within which just about anything goes, including for instance unterminated strings. This put me off, but I believe it would be the lesser evil to deal with. For performance we could optionally include some special case for conditionals where we quickly scan for the next #else
, #endif
, etc. and defer breaking the contents up into preprocessing tokens until we know that the block is actually included.
Another aspect that drove my decision was the fact that you can paste tokens together using the ##
preprocessor operator or “stringize” them with #
, and that felt much more straightforward to do if you just kept it as text. But again, this is manageable. Since preprocessing tokens are always taken verbatim from the source file they can simply be a type, a char *
and a size. This makes token pasting easy and stringizing trivial.
Overall, dealing with these annoyances would have been nicer than running the preprocessor directly from text, and this is something I’d do differently if I did it again.
Lexer
$ ncc -dump-tokens -c max.c
1:2, TOK_SYMBOL(int)
3:5, TOK_SYMBOL(max)
3:8, TOK_LROUND
3:9, TOK_SYMBOL(int)
3:13, TOK_ASTERISK
<snip>
11:2, TOK_SYMBOL(return)
11:9, TOK_SYMBOL(m)
11:10, TOK_SEMICOLON
12:1, TOK_RCURLY
The lexer is very simple, and there’s not much to talk about. It’s the exact loop over the text + switch on the next character that you would imagine. C’s lexical grammar is pretty simple, and while you can use a lexer generator it’s really not necessary and just adds complexity. Numbers are a little fiddly, having to handle 3 different integer radixes and a couple of floating point formats, but everything else is quite easy.
One choice I made was to merge identifiers and keywords into one token type (“symbol”) which contains a string. The idea was to make it easier to produce nice error messages when you use a keyword in place of an identifier, as you can produce a complete parse tree and reject invalid identifiers later, rather than failing to produce a parse tree entirely. It does mean however that we do a lot more full string comparisons in the parser to check for keywords. If they were separate token types then depending on the structure of the parser you could still add a special case that checks if a keyword was used when an identifier was expected and fails forward, producing a parse tree with a dummy identifier in place of the incorrectly used keyword. An alternative could be to keep the “symbol” token type, but to intern every symbol to make all such comparisons constant time (with keywords pre-interned up-front).
We return tokens augmented with source location information, for error messages. A key thing to note about source location is that almost all of it will be thrown away and never used. You only use it when showing error messages, and for implementing __LINE__
and __file__
. It may be better to not compute the line and column number up-front, and instead to just store a byte offset. Making it as compact as possible would be beneficial for the common case where it’s unused, which could be worth slowing down the uncommon case for.
Our source location information ended up being quite insufficient in many cases anyway. Because of the preprocessor, you need to store a lot more context in order to adequately identify where a token came from. The error could be three #include
’d files deep, within the expansion of a macro where some of the text was directly present at the macro invocation, some was in the macro definition, and some was pasted together from a combination of both, all of which could be in different files. Really you need some kind of rich context tree, not just a filename, line and column.
Parser
$ ncc -dump-ast -c max.c
FUNCTION_DEF(
DECL_SPECIFIER(
NAMED_TYPE_SPECIFIER(
int
)
),
DIRECT_DECLARATOR(
<snip>
),
OLD_STYLE_PARAM_DECL_LIST(
),
COMPOUND_STATEMENT(
BLOCK_ITEM_DECL(
DECL(
DECL_SPECIFIER(
NAMED_TYPE_SPECIFIER(
int
)
),
INIT_DECLARATOR(
DIRECT_DECLARATOR(
IDENTIFIER_DECLARATOR(
m
)
),
EXPR_INITIALIZER(
UNARY_MINUS_EXPR(
INT_LITERAL_EXPR(
2147483648
)
)
)
)
)
),
<snip>
BLOCK_ITEM_STATEMENT(
RETURN_STATEMENT(
IDENTIFIER_EXPR(
m
)
)
)
)
)
The parser is something I would rewrite completely. The current implementation uses a custom Parsing Expression Grammar DSL, with a Python program that generates C code based on the grammar specification. This works fine and is an elegant way of thinking about the grammar but has a few issues. Handling precedence is annoying, requiring a separate parser function for every precedence level, with the parser inefficiently walking down a level for every expression type. Parsing lists requires a fold operation and making linked lists of nodes which are inefficient to traverse later. Eliminating left recursion results in some awkward grammar rules.
The parser distinguishes between alternative options at each point by simply trying each in sequence, and backing up when it can’t match against the input tokens, even if the two options contain a common prefix. This results in a lot of repeated work, and a lot of allocated AST nodes which we have to just throw away. This can be improved by turning it into a packrat parser and memoising results at each point, but this increases memory usage heavily.
Ultimately this could all be replaced with a simple hand-written recursive descent parser, using Pratt parsing for precedence. This would solve the problems above while also allowing for much better diagnostics and recovery, as you can hand-write the diagnostics for particular failed matches using your intuition for what the user likely intended. By carefully studying where grammar rules have common prefixes and storing those prefixes until you find the disambiguating token you can avoid most (all?) lookahead.
For instance, at the top level of a C source file the two valid syntactic options are a declaration and a function definition. Looking at the specification these look quite distinct, but once you get through the layers of grammar rules you’ll find that they both start with (simplifying slightly) a type. My current parser would try each option in sequence, backing up to the start once it fails to parse them as the attempted grammar rule. Instead, once you’ve identified the common prefix, you can just parse that prefix unconditionally until you get to the point which would distinguish the two. At this point you can distinguish them without any lookahead, and simply pass the already-parsed state into whichever of the two it turned out to be.
I think parsing gets too much attention. Especially all the “academic” parsing techniques which are rarely used in practice. I can somewhat understand why, as there’s all sorts of interesting mathematical theory behind different classes of language and formal grammars and parser generators, etc. But in practice, complicated formal languages are slow to parse and hard for humans to understand, and generated parsers have poor ergonomics.
Programming language syntax is designed, it doesn’t just arise out of the ether and demand we parse it in whatever form it happens to take. So with some notable exceptions, programming language syntax is designed to be straightforward to parse and read, and has tended more in this direction over time. Rust, for instance, tends to place a unique keyword immediately up-front in situations with many different syntactic options. I don’t know for sure that this was done to simplify parsing, but I strongly suspect it.
Compared to later stages of a compiler, going from raw text to a structured syntax tree is pretty straightforward, and there’s really just one valid result. Other topics like optimisation are much more interesting and open-ended. A parser can be “done”, but an optimiser is never.
IR generation
This is a mostly a pretty straightforward mapping of source operations to IR operations. Unlike some languages, C doesn’t have complicated semantics implicit in the source. Things are mostly spelled out very explicitly, which makes this phase simple. There are a few code shape choices I’d change (like using a “double test” form for loops instead of the current “single test”) but nothing major.
Overall it follows the philosophy of generating straightforward IR that’s guaranteed to be correct in all situations, and not trying to be clever to generate better IR. Syntactic elements are translated using only local knowledge as much as possible, not paying any attention to their surrounding context. This can result in very verbose and redundant IR, but the idea is that this is left to later stages to clean up in order to make IR generation as simple as possible.
Middle-end
IR
$ ncc -dump-ir -c max.c
max (*, i32) -> i32 = {
0.entry:
#0 * = local(*)
store(#0, @0)
#2 * = local(i32)
store(#2, @1)
#4 * = local(i32)
store(#4, -2147483648)
#6 * = local(i32)
store(#6, 0)
jump(1.for.ph)
1.for.ph:
#9 i32 = load(i32, #6)
#10 i32 = load(i32, #2)
#11 i32 = cmp(slt, #9, #10)
cond(#11, 2.for.body, 6.for.after)
2.for.body:
#13 * = load(*, #0)
#14 i32 = load(i32, #6)
#15 i64 = zext(#14, i64)
#16 i64 = cast(#13, i64)
#17 i64 = mul(#15, 4)
#18 i64 = add(#16, #17)
#19 * = cast(#18, *)
#20 i32 = load(i32, #19)
#21 i32 = load(i32, #4)
#22 i32 = cmp(sgt, #20, #21)
cond(#22, 3.if.then, 4.if.after)
3.if.then:
#23 * = load(*, #0)
#24 i32 = load(i32, #6)
#25 i64 = zext(#24, i64)
#26 i64 = cast(#23, i64)
#27 i64 = mul(#25, 4)
#28 i64 = add(#26, #27)
#29 * = cast(#28, *)
#30 i32 = load(i32, #29)
store(#4, #30)
jump(4.if.after)
4.if.after:
jump(5.for.update)
5.for.update:
#35 i32 = load(i32, #6)
#36 i32 = add(#35, 1)
store(#6, #36)
jump(1.for.ph)
6.for.after:
#39 i32 = load(i32, #4)
ret(#39)
}
The intermediate representation (IR) is a hybrid linear/graph IR, with an explicit control flow graph of basic blocks containing linear sequences of static single assignment (SSA) form instructions. It’s very similar to the LLVM IR, which is no coincidence as I was reasonably familiar with it already. LLVM is very battle-tested, and following its lead has served well.
The only change of note that I’d make is changing from phi functions to block arguments. Phi functions are used in SSA to make values which depend on control flow. They take a set of (block, value) pairs, and evaluate to a different value depending on the previously executed block. Block arguments are equivalent but a little easier to reason about.
Using block arguments could also simplify a few other things: we have a special kind of value for function arguments (you can see them with the @
sigil above). With block arguments, we could just make the entry block take arguments and unify them. Similarly we could have a special “return” block which contains no instructions, and replace the ret
instruction with a branch to this block (which would also make the control flow graph cleaner, by giving us a terminal block where control flow always terminates).
IR transformations
The main advantage of such an intermediate representation is ease of analysis and transformation. My focus on getting to self-hosting means I never got around to actually doing any of this, as it isn’t necessary for simply producing a working binary out the end.
The only exception is the very basic optimisation that when creating an instruction, if it has two constant operands we constant-fold the result immediately. This actually simplifies some things in the backend, as x86 doesn’t have a two-immediate form for many operations. We’d either need to constant fold or spill immediates, and constant folding up-front is very simple.
I’m very interested in optimisation so this is something I want to do a lot more of in the future.
Backend
Instruction selection
$ ncc -dump-live-ranges -c /tmp/max.c
max:
entry:
0 mov #0(64), rdi
1 mov #1(32), esi
2 mov [rsp], #0(64)
3 mov [rsp + 8], #1(32)
4 mov [rsp + 12], dword -2147483648
5 mov [rsp + 16], dword 0
6 jmp for.ph
for.ph:
7 mov #2(32), [rsp + 16]
8 mov #3(32), [rsp + 8]
9 cmp #2(32), #3(32)
10 jl for.body
11 jmp for.after
for.body:
12 mov #4(64), [rsp]
13 mov #5(32), [rsp + 16]
14 mov #6(32), #5(32)
15 imul #7(64), #6(64), 4
16 mov #8(64), #4(64)
17 add #8(64), #7(64)
18 mov #9(32), [#8(64)]
19 mov #10(32), [rsp + 12]
20 cmp #9(32), #10(32)
21 jg if.then
22 jmp if.after
if.then:
23 mov #11(64), [rsp]
24 mov #12(32), [rsp + 16]
25 mov #13(32), #12(32)
26 imul #14(64), #13(64), 4
27 mov #15(64), #11(64)
28 add #15(64), #14(64)
29 mov #16(32), [#15(64)]
30 mov [rsp + 12], #16(32)
31 jmp if.after
if.after:
32 jmp for.update
for.update:
33 mov #17(32), [rsp + 16]
34 mov #18(32), #17(32)
35 add #18(32), 1
36 mov [rsp + 16], #18(32)
37 jmp for.ph
for.after:
38 mov #19(32), [rsp + 12]
39 mov #20(32), #19(32)
40 jmp ret
#0i: [0, 2]
#1i: [1, 3]
#2i: [7, 9]
#3i: [8, 9]
#4i: [12, 16]
#5i: [13, 14]
#6i: [14, 15]
#7i: [15, 17]
#8i: [16, 18]
#9i: [18, 20]
#10i: [19, 20]
#11i: [23, 27]
#12i: [24, 25]
#13i: [25, 26]
#14i: [26, 28]
#15i: [27, 29]
#16i: [29, 30]
#17i: [33, 34]
#18i: [34, 36]
#19i: [38, 39]
#20i: [39, 39] = (rax)
(This output includes live ranges, which are computed a bit later as part of the register allocator)
The instruction selector is responsible for taking the IR and converting it into assembly instructions for the target platform. My instruction selector uses the most basic, obvious method, referred to in the literature as “macro expansion”. Each IR operation expands out into a template of assembly instructions. Each such expansion defines a virtual register which the result is available in afterwards. This virtual register is stored directly in the IR instruction, so later instructions referring to this instruction can simply use that virtual register to read the result.
There are a few ad-hoc exceptions, some of which are driven by instructions x86 doesn’t have, and some of which are just because I couldn’t stand how bad the generated code was. For instance, a simple condition like if (x < y)
would be compiled into a comparison between x
and y
, followed by a comparison between the result and 0
. The resulting code is just so redundant that I couldn’t help myself from adding a special case to make it less awful, although really this transformation would be better at the IR level.
The code we produce is very inefficient in other ways too. For instance, we make no assumptions about whether values are still in use after being used as input to a given instruction. So every operation has to include an extra copy of one of the inputs to a new register before adding the other value to it, despite the fact that most of the time values aren’t used multiple times (especially since we don’t do any kind of redundancy elimination on the IR that might result in merging duplicate instructions together).
From later research, this style of instruction selection is basically the obvious first thing that everyone thinks of, and it was widely used in the ’60s before falling out of favour in the ’70s when better techniques were discovered. I think there are some better variants of what I’m doing which use some additional, lower-level IR and apply various peephole optimisations before producing the final code. But from what I understand the main paradigm these days is to treat the IR as a graph (with values/instructions as vertices and dependencies as edges), and write rules for how each assembly instruction covers parts of that graph. Then instruction selection consists of finding an optimal set of assembly instructions to cover (or “tile”) the graph (potentially with overlaps). This allows the compiler to use assembly instructions with multiple outputs, as they will just cover a larger section of the graph. It’s also more generic, as you can create these specifications for multiple architectures and then swap out the specification while re-using the machinery to find the covering. So, if I was to redo this I would probably rewrite the instruction selector completely to use one of these better techniques.
I’d also shift a few things into the IR generator to simplify this phase. In particular, parts of the calling convention. Currently the IR supports arbitrary types for function arguments and return values. But the calling convention (for System V x86-64 at least) can be complicated. Argument values can be spilt to the stack and passed in as pointers or packed into registers. Structs can be split apart into multiple register arguments, and return values can be converted into “out” pointer parameters that are passed in and written to. Some parts need to be handled in the backend, such as assignment to particular registers, but some of this would be much better handled in IR rather than while lowering call
instructions.
The most annoying example is large arguments that must be spilled onto the stack. Copying a large value is a job for memcpy
, but we only find out that we need to do the copy in the middle of generating the call sequence. At this point we may already be using the registers we’d use to call memcpy
for earlier arguments in the call sequence we’re generating. I ended up working around this by generating a bad open-coding of memcpy
inline instead, and it’s all very messy.
Converting these arguments into pointers up-front at the IR level would be much easier, plus it would make these copies visible to any IR-level transformations. If, for instance, we apply a transformation that reduces a struct to scalars, we may be able to forward those values directly into argument registers rather than producing a copy back onto the stack for the call sequence, which will then just be pulled apart again when lowering the call instruction. I’m unsure exactly where would be best to do the calling convention implementation – either up-front immediately when doing IR generation for a call, or later on as an IR pass. I can see arguments for either, but either would be better than doing it in the instruction selector.
Doing this kind of thing is a little unintuitive. One of the most touted advantages of an IR is target-independence, so that your frontends, optimisation passes, and backends are all interoperable. With this hammered in it feels weird to do target-dependent stuff at the IR level. But we’re not trying to produce something like Java bytecode, where the intermediate representation can be copied around to whatever platform you like and executed. It’s okay to bake ABI-specific details into the IR. These details are still baked in as target-independent operations and types which optimisation passes and instruction selection can handle generically, so we don’t lose the interoperability we’re after.
Instruction scheduling
The generated assembly from the instruction selector can be rearranged, as long as doing so doesn’t break any dependencies (i.e. moving a use of a register before a definition). Doing so can improve performance on modern pipelined processors. Just like IR transformations, this is a purely optional step that isn’t required for correctly compiling a program, and similarly I didn’t get around to it. This would also be nice to build at least a basic form of. It would be much lower priority, as the potential performance improvement is not as high, and it depends much more heavily on specific microarchitectural details.
Register allocation
$ ncc -dump-asm -c /tmp/max.c
section .text
global max
max:
push rbp
mov rbp, rsp
push r13
push r14
push r15
sub rsp, 24
entry:
mov r13, rdi
mov r14d, esi
mov [rsp], r13
mov [rsp + 8], r14d
mov [rsp + 12], dword -2147483648
mov [rsp + 16], dword 0
jmp for.ph
for.ph:
mov r13d, [rsp + 16]
mov r14d, [rsp + 8]
cmp r13d, r14d
jl for.body
jmp for.after
for.body:
mov r13, [rsp]
mov r14d, [rsp + 16]
mov r15d, r14d
imul r14, r15, 4
mov r15, r13
add r15, r14
mov r13d, [r15]
mov r14d, [rsp + 12]
cmp r13d, r14d
jg if.then
jmp if.after
if.then:
mov r13, [rsp]
mov r14d, [rsp + 16]
mov r15d, r14d
imul r14, r15, 4
mov r15, r13
add r15, r14
mov r13d, [r15]
mov [rsp + 12], r13d
jmp if.after
if.after:
jmp for.update
for.update:
mov r13d, [rsp + 16]
mov r14d, r13d
add r14d, 1
mov [rsp + 16], r14d
jmp for.ph
for.after:
mov r13d, [rsp + 12]
mov eax, r13d
jmp ret
ret:
add rsp, 24
pop r15
pop r14
pop r13
pop rbp
ret
section .data
section .bss
The instruction selector produces a form of assembly that uses the normal set of assembly instructions, but assumes an infinite supply of registers in the form of numbered “virtual registers”. The actual ISA only has a limited supply of registers of course, so we have to allocate a “physical register” to each virtual register. This has two main stages.
The first is computing a “liveness” for each virtual register. We need to know when the register is “alive” – after a definition and before a use. Two virtual registers which are never alive at the same time can be assigned the same physical register, as they won’t interfere. Since we use linear scan allocation (discussed below), we approximate the liveness for each register as a simple interval over the whole function.
Live range computation is the slowest part of the entire compiler. For most functions it’s fine, but there’s one particularly large generated function in the assembler for which live range computation takes a very long time. The problem is that we compute the live range for each virtual register separately, and each one requires examining the entire function. We have no explicit CFG at this point, so we can’t even precompute live-in and live-out sets for each basic block and we just iterate over every single instruction. The number of virtual registers is roughly proportional to the size of the function, so this is essentially O(n^2), and we waste lots of time scanning through tons of instructions that aren’t relevant for the virtual register we’re considering. Computing information for all virtual registers at once would be much more efficient, and having an explicit assembly-level CFG would help too.
The second stage is to use this information to allocate each virtual register a physical register. For this we use linear scan, which is a really nice and simple algorithm. There are various improvements that could be applied here such as coalescing, rematerialisation, improvements to spilling and more precise liveness tracking. We don’t do any of that, just very basic linear scan. We also spill very naively – if a register needs to be spilled we simply spill it completely, turning every read and write into stack accesses.
The main alternative to linear scan is to treat it as a graph colouring problem. Rather than computing liveness as a simple range, you build a full “interference” graph, where every virtual register is a vertex, and an edge is added between two virtual registers if those registers cannot be assigned the same physical register (i.e. they don’t “interfere”). Now register allocation is reduced to the problem of finding a colouring of that graph, limiting the number of colours to the number of physical registers available. This allows you to express more precise relationships between registers. Control flow might jump over a region of the code, meaning that a virtual register isn’t live there even though it is live at an earlier and later instruction. Also, whereas linear scan is inherently greedy, graph colouring approaches can be non-greedy, so there’s the potential for finding better solutions.
I’m not sure what approach I’d choose if I reworked this, I’d probably have to do a lot more research on the state of the art. Caveat on the following: this topic seems very complicated and I’ve only yet had the briefest look into the literature, so this may be completely confused and incorrect. That being said, from what I can tell, the actual colouring part of register allocation is considered pretty trivial, and other aspects such as splitting and spilling (in particular placing the spills well) are much more important and difficult to get right.
I think this and the other backend phases would benefit from a new IR in between the mid-level IR and raw assembly. It would use the same assembly instructions, but put all of them in an SSA form, where input and output registers are separated. We’d handle the final lowering by adding constraints when allocating registers, such that the output register must be allocated to one of the input registers. Or perhaps by expanding $1 = ADD $2, $3
into MOV $1, $2; ADD $1, $3
, and then relying on register coalescing to eliminate the MOV
s. Maintaining SSA form in the backend would simplify many algorithms, and since the input IR is in SSA form already it would be easy to generate.
In particular I think that SSA form assembly would make colouring completely trivial. It’s been shown that the interference graphs of SSA form programs are chordal, and therefore optimally colourable in polynomial time. Not only that, explicit construction of the interference graph isn’t necessary, as you can simply traverse the control flow graph in an order that respects dominance and it all just works out.
Register allocation seems to be a very crucial area for performance, so the design would warrant some serious thought.
Assembler
Even with how complex the x86-64 instruction encoding is, assembling is pretty straightforward. Once you know how a particular instruction is encoded it’s just a matter of putting all the bits in the right place. Parts of the encoding format are weird and complicated, but it’s essentially just business logic. The more interesting part is how you specify instruction encodings. Doing this directly in C would be very tedious, so again here I have a DSL. This DSL has one instruction format per line in almost the exact syntax used in the Intel architecture manual. The left hand side of each line specifies the mnemonic and operand types, and the right hand side specifies the encoding.
Here’s an example excerpt, which specifies all the different forms of ADD
we support:
ADD r/m8, r8 = 00 /r
ADD r/m32, imm32 = 81 /0 id
ADD r/m32, r32 = 01 /r
ADD r/m64, imm8 = REX.W + 83 /0 ib
ADD r/m64, imm32 = REX.W + 81 /0 id
ADD r/m64, r64 = REX.W + 01 /r
This file is compiled into a giant C function which takes in a single instruction and compares it against every encoding format to find a match. Once it’s found a match it calls the (hand-written) instruction encoding function with the appropriate set of arguments. This makes adding new instructions very easy. This worked very nicely, and I’d do it the same way again.
In fact, I would lean into this even more heavily and include more data in the DSL. Firstly the list of assembly instructions is a separate enum which has to line up with those listed in the DSL, and this should be generated too. There are various instruction properties which are specified in big switches (such as whether it sign-extends its immediate, and which operands it reads/writes). These are easy to forget to update, so it would be nice to include them in the DSL as well. This would be especially the case if we want to port it to a different architecture. Then we’d likely also want to specify things like the set of registers each architecture has, how they overlap and what classes of value can go in each (this is how you end up with tablegen).
Producing ELF object files is also relatively straightforward. Once you understand the basic concepts of sections and symbols and relocations, it’s just a matter of filling in the structures according to the standard and writing them out.
Linker
The linker is deliberately kept quite simple. I didn’t intend on making it something that can handle arbitrary object files, linker scripts, shared object files, etc. It works as well as it needs to in order to handle the object files and static libraries we produce. If you feed it object files produced by another tool it will most likely spew incomprehensible error messages at you and exit. It would be nice if this wasn’t the case. Being able to compile the libc with a system compiler and link that against our object files would be handy for testing. As it is we work around this by building the compiler first, and compiling the libc with the newly built compiler. This means that compiler bugs often manifest in compiling the libc, which we have to do before we can even run our tests. These bugs are harder to diagnose than errors occurring in our nice, small, focused tests.
Fixing this would either mean making the linker complete enough to handle anything you can throw at it, or simply getting it working with one configuration and not caring if others can build it. This is actually what we used to do (in combination with passing certain flags to stop the system compiler from generating unnecessary sections which we can’t handle), but I switched to compiling with the newly built compiler when someone else tried building it and reported an issue.
I always treated required linker features as things to hack together as quickly as possible to reach the goal of self-hosting, so I don’t have a great idea of how hard it would be to make it more robust. But the linker cuts a lot of corners, so it would require making the linker much more general. For instance, right now we always produce the exact same fixed list of sections and segments in the output binary. This would certainly not work if we want to take arbitrary object files as input.
Overall I enjoyed the process of writing a linker and I’m glad I did, as it gave me a great understanding of how they work, and demystified what is quite a mysterious tool for many people. But I’m not very interested in developing it further, so if I did this again I’d probably just use the system linker and focus on parts of the toolchain I find more interesting.
libc
Like the linker, the libc is barebones and designed to be as simple as possible to support self-hosting. Most of the functions are pretty simple, forwarding to Linux syscalls without doing much work. Some like printf
and qsort
are a bit more complicated but easy enough. And some like malloc
are a whole project in and of themselves, if you want to implement them well. I did not implement malloc
well – my implementation is just a free list with individually mmap
‘ed entries. Not efficient at all, but it does the job.
This falls into the same category as the linker for me: fairly easy and enlightening to write a basic version of, I’m glad I did it, but in the future I’d probably just use an existing one and spend my effort elsewhere.
Future plans
What next from here?
I’m interested in working on a more ergonomic and fully featured frontend (including compiling some large real-world applications), IR-level optimisations, and a more sophisticated backend. Optimisation in particular interests me the most. All of the other parts (linker, libc, textual assembler, ar
clone) honestly feel like more of a distraction at this point. I don’t regret doing them, I learned a lot and it was very satisfying to have a complete, self-contained toolchain I’d written entirely myself. But I don’t feel any need to keep working on them. Many of the other parts could also do with a complete rewrite, as discussed in their individual sections above.
I’m also getting more and more tired of working in C. It’s a timeless and elegant language in its own way, but there’s not really any reason to use it for the compiler other than the appeal of self-hosting. The more I read about the parts of the compiler I’m interested in, the less well-suited C seems for the task. Having no built-in data structures and no ability to generically define them without resorting to lots of inconvenient hacks is a huge pain, and ultimately I feel I’d spend way too much time on these kind of details rather than the stuff I actually want to learn about. Other languages have a lot of features I sorely miss when working on a large-scale software project like a compiler.
I think this all suggests starting a new project rather than continuing with this one. So that’s what I intend to do. Stay tuned.