A C language compiler implemented in a few hundred lines of code!
This article is intended to eliminate the "fear" of friends about the compilation principle: it is not so high and out of reach!
Today’s protagonist is a C language compiler- C4 (GitHub Fork: 1.4K, Star: 8.6K) written by a foreign expert .
C4: C language compiler implemented with 4 functions
C4, C in four functions.
It is a C language compiler project (the address is at the end of the article, or click "Read the original text" to jump directly). The entire implementation only has:
-
• A C language source code file
-
• 528 lines of C code
-
• 4 functions
That's all.
It's concise, but not simple.
It has simple and complete lexical analysis, syntax analysis, semantic checking, code generation, and runtime environment (i.e. virtual machine).
Different from common C compilers, it compiles C language source programs into bytecode and then executes them in a streamlined virtual machine.
Do you think this is over? No! It’s so much more amazing than that!
C4: C language compiler that can be bootstrapped
If it were just streamlined, maybe it wouldn't be so surprising. After all, there are many similar compiler projects on the Internet, many of which are very simple, elegant, and outstanding projects.
However, the most amazing thing about C4 is that although it only has more than 500 lines of code, it achieves bootstrapping .
The so-called bootstrapping, simply put, means compiling yourself. This is usually one of the criteria used to measure whether a compiler is fully functional.
Next, we demonstrate the "Hello, World!" example and the C4 bootstrapping example.
Hello, World example
First use GCC to compile C4 into an executable file:
gcc c4.c -o c4
Compile and run the test program hello.c:
The result is as shown below:
Among them, "hello, world" is printed when hello.c is executed, "exit(0) cycle = 9" is printed by the c4 compiler, where "exit(0)" means that the program is executed normally, and "cycle = 9" means After c4 compiles hello.c, a total of 9 bytecodes are generated.
C4 Bootstrap Example
We use GCC to compile c4.c to generate the executable file c4 , which we call compiler A. Then we use A to compile and execute the source code c4.c of c4, generate a compiler B , and then use B to compile and execute hello.c .
./c4 c4.c hello.c
The result is as shown below:
It can also be like this:
./c4 c4.c c4.c hello.c
The c4 generated by GCC compilation is compiler A , "./c4 c4.c" generates compiler B , "./c4 c4.c c4.c" generates compiler C , and finally compiles and runs hello.c with compiler C. .
continue:
In theory, the recursion can continue forever. However, as can be seen from the figure, the deeper the level of recursion, the more bytecode is generated and the more time it takes to execute.
C language subset implemented by C4
C4 is committed to implementing a bootstrapping C compiler with the least amount of code. It consists of only 528 lines of code and 4 functions. It is conceivable that it is impossible to completely implement all the features defined by the C language standard. It only implements the minimum subset of the C language required for bootstrapping .
type of data
-
• char
-
• int
-
• Pointer
-
• enum (enum)
-
• Array
-
• String
Data types such as struct, typedef, and union are not supported.
Statement structure
-
• if-else control statement
-
• while loop statement
-
• return statement
-
• Function
Statement structures such as do-while, switch-case, for, continue, break, goto, etc. are not supported.
operator
It supports almost all operators except +=, %=, <<=, &= and other conforming operators. include:
-
• Arithmetic operators
-
• Relational operators
-
• Logical Operators
-
• Bit operators
-
• Assignment operator
-
• Miscellaneous operators (such as the ternary operator?:)
Built-in library functions
The C4 compiler uses some system library functions when implementing it. Therefore, in order to implement bootstrapping, it also has built-in support for several library functions. Including: open, read, close, printf, malloc, free, memset, memcmp, exit
It should be noted that it does not support preprocessing commands starting with #, such as #include, #define, #if, etc.
Code comments only support single-line comments starting with "//" and do not support multi-line comments marked with "/* */".
Compared with traditional C language compilers, C4 has its own unique features in implementation.
Next, let’s briefly introduce the implementation process of traditional compilers.
Typical compiler implementation
A typical compiler implementation generally has the following steps:
-
• lexical analysis
-
• Gramma analysis
-
• Semantic Analysis
-
• Intermediate code generation
-
• Code optimization
-
• Machine code generation
This is true for GCC, Clang, Go, etc.
At each stage, the code is scanned one or more times. The "code" here includes source code, syntax tree, intermediate code and other representations in text form.
In a typical implementation, lexical analysis and syntax analysis are usually combined (of course they can be separated). During syntax analysis, the lexical analyzer is called to obtain tokens one by one. Therefore, in theory, the lexical analysis and syntax analysis stages only need to scan the source code once (of course there are exceptions) and generate a syntax tree, or Abstract Syntax Tree (AST).
In the semantic analysis stage, the main object of the operation is this tree, and this tree must be scanned at least once. In some implementations, intermediate code is directly generated while performing semantic checking.
In the code optimization phase, depending on the intensity of compiler optimization, the intermediate code may be scanned multiple times.
The so-called "scan once" here is generally called pass in compiler terminology. Friends who know LLVM should know that each type of optimization in LLVM is a pass. To apply multiple optimization techniques, multiple passes are required.
What makes C4 unique?
As a C4 compiler that pursues minimalism, it has many unique features in its implementation.
Compile C into bytecode and execute it in the virtual machine
Traditional C language compilers eventually compile C language source code into executable files (such as ELF, PE, etc.), which are binary machine codes.
C4, on the other hand, first compiles the C language source code into its specially designed bytecode, and then executes it directly in the virtual machine, similar to Java, Lua, etc.
C4 designed 39 bytecode instructions, most of which are similar to assembly instructions, mainly memory access instructions, arithmetic operation instructions, etc. Among them, in order to support the built-in library functions, 9 special library function calling instructions are specially designed.
Its virtual machine is a typical stack virtual machine (early Lua was also a stack virtual machine, but the latest Lua 5.x uses a register virtual machine), and its implementation is also very streamlined.
In addition, it also supports simple debugging functions. For example, you can use the -d command to dump the generated bytecode. The picture below is the bytecode of hello.c:
Only scan the source code once
Different from traditional compiler implementation, C4 cleverly combines the steps of lexical analysis, syntax analysis, semantic analysis, and code generation. In the entire process of compiling C language source code into bytecode, it only scans it once. Source code .
Lua's interpreter also scans the source code once, so the performance of C4 and Lua is quite good.
Readability of C4 source code
When C4 was released that year, it caused quite a stir in some well-known communities on the Internet and triggered a lot of discussions. Regarding the implementation of C4, some people think that the implementation of C4 is very concise and easy to read, while others think that the implementation of C4 is slightly obscure.
In my personal opinion, the implementation of C4 is indeed very simple. After all, there are only 4 functions and more than 500 lines of code. But to truly fully understand, you need to have a certain basic knowledge of compilation principles, which is also the only threshold for understanding C4.
For example , the syntax analysis process of C4 is a typical implementation method that combines recursive descent and operator precedence algorithm . As long as you understand the classic algorithm principles of these compilers, it is relatively easy to understand the implementation logic of C4.
In addition, in order to pursue bootstrapping with the least amount of code, C4 uses some techniques in implementation, which has a certain impact on the readability of the code.
For example, we mentioned earlier that C4 does not support the struct type. This also means that the struct type cannot be used in C4 source code. To do this, it chooses to use arrays to simulate struct structures. At first glance, this may cause some confusion.
Personally, I think that one of the shortcomings of C4 is that the naming of its variables is too concise. For example, the variable that marks the location of the bytecode is represented by "e". In fact, if "emit" is used, it may be much clearer and easier to understand.
In addition to the native implementation of C4, various masters on the Internet have also developed many derivative implementations based on C4. For example, someone added more than 80 additional lines of code to C4, but made C4 support JIT, and the execution speed was greatly improved. It was verified on my machine that the execution speed increased by 27 times! Some people have also made simple extensions to C4 so that it can directly generate real machine code and finally generate ELF executable files.
summary
Modern compiler projects, such as GCC and Clang/LLVM, have very complex implementation logic and huge code sizes, making it almost impossible for one person to fully understand them. Even the streamlined implementations of TCC and LCC, which are highly recommended on the Internet, are relatively complicated to implement, and it is difficult for novices to thoroughly study them.
Personally, I think C4 is one of the compiler projects that is very suitable for novices to study. After all, there are only four functions, more than 500 lines of source code, and there are no particularly complex and difficult algorithms in it. Of course, before studying, it is best to have some basic knowledge of compilers, so that it will be easier to understand.
Although C4 is concise, it cannot be explained completely clearly in one article. Therefore, this article only briefly introduces C4 and does not discuss too much about the specific implementation of C4. Interested friends can find it on github, study it carefully, or send me a private message. If there are many interested people, I can update a few articles to explain its code implementation in detail.
C4 project address
https://github.com/rswier/c4
Recently, many friends have asked me for some essential information for programmers, so I dug out the treasures at the bottom of the box and shared them with everyone for free!
Scan the QR code of the poster to get it for free.