“The compiler is giving so many errors, life is hell, arggghh!”
“Oh, it was a missing semicolon, never mind.”
We’ve all been there, going through all the compile-time errors, only to realize the smallest issue.
The compiler has always been a black box for us. We press the compile or run button, and an output or a bunch of errors appear on the screen.
Today, I am going to explain the key steps that happen behind the scenes. This is gonna be a good one!
What is a compiler?
The compiler is software that takes our code in a high-level programming language and translates it into machine-readable code so that the hardware can understand it.
The translation from a high-level programming language to machine code consists of a lot of steps.
We will be using this C++ program saved in source.cpp as a reference throughout this article:
// source.cpp
#include <iostream>
using namespace std;
int main() {
int a = 0, b = 15;
// sum a and b
int x = a+b;
cout<<x;
return 0;
}It is adding two integers, a and b, storing the sum in x, and printing x using cout from iostream header.
These are the steps in the compilation process:
Technically, the compilation process ends with the generation of machine code (object files). Steps like linking, loading, and execution belong to the broader program lifecycle, which encompasses creating and running an executable.
You can check out this video too:
1. Preprocessing the source file
The compiler fetches all files that are included using #include in the source file, and the contents of those files are copied into the file, removing the need of #include.
So, after preprocessing, the code contains the implementation from iostream (functions, objects, and definitions), along with the original code.
Macros defined using #define are expanded in the code. Comments and white spaces are also removed.
So, currently, I have my source.cpp file. If in the same directory, I run
g++ -E source.cppThe -E flag tells the compiler to only preprocess the source file and output the result to the terminal (standard output) without compiling it further.
So, at this stage, the file looks like the following:
Notice that in the end, #include is removed, and all the contents of iostream are copied above the code. Comments are also removed.
2. Lexical analysis
After preprocessing, the compiler scans the source code and breaks it into tokens. Tokens are the smallest units of the program, such as keywords, identifiers, operators, and punctuation.
The lexical analyzer will then break the code into the following tokens:
using, namespace, std (namespace declaration)
int, main, (, ), {, } (function declaration)
int, a, =, 0, ,, b, =, 15, ; (variable declarations and initialization)
int, x, =, a, +, b, ; (variable
xdeclaration, initialization, and addition operation)cout, <<, x, ; (output stream and operation)
return, 0, ; (return statement)
The lexical analyzer, also known as the scanner, outputs these tokens as a stream, which is passed to the next phase.
I used the following command to view tokens generated for my code in source.cpp after preprocessing:
clang -Xclang -dump-tokens source.cpp3. Parsing
This step checks the syntax of the code to make sure that it follows the rules of the programming language.
Parsing takes the stream of tokens produced during lexical analysis and organizes them into a tree-like structure called an Abstract Syntax Tree (AST) or Parse Tree.
So, for our code, the parse tree would mostly look like:
4. Semantic analysis
Semantic analysis is the phase where the compiler checks the meaning of the code after it has passed the parsing phase. It ensures that the program is semantically correct by applying rules of the programming language, such as type checking, scope resolution, and function usage.
Type Checking: Ensures operations are performed on compatible types (e.g., int x = a + b; checks if
aandbare integers).Scope Resolution: Verifies variables and functions are declared before use (e.g., cout << x; checks if
xexists in scope).Function and Return Validation: Confirms return types match function declarations (e.g.,
int main()must return an integer).Operator Validation: Ensures operators are used with valid operands (e.g.,
+works with integers, << works with cout).
Output: An annotated syntax tree (AST) with additional details like types, scopes, and resolved references.
5. Intermediate code generation
In this step, the compiler translates the annotated syntax tree (AST) into an intermediate representation (IR). The IR is a simplified, low-level, platform-independent code that is easier to optimize and translate into machine code.
For our example, the IR might look like this in a three-address code (TAC) format:
main:
a = 0
b = 15
t1 = a + b
x = t1
call print, x
return 0TAC is an intermediate representation where each instruction has at most three operands. The IR varies by compiler. Some use three-address code, while others use representations like LLVM IR or abstract syntax trees.
LLVM IR is closer to assembly but still abstract enough for portability and optimization. For x = a + b;, LLVM IR might look like this:
%1 = load i32, i32* %a // Load value of 'a'
%2 = load i32, i32* %b // Load value of 'b'
%3 = add i32 %1, %2 // Add 'a' and 'b'
store i32 %3, i32* %x // Store result in 'x'Let’s see the LLVM IR in the terminal for our code using the command
clang -S -emit-llvm source.cpp -o source_unopt.ll.This will dump the IR in source_unopt.ll file.
6. Optimization of IR
We are halfway through, phew!
Once the Intermediate Representation (IR) is generated, the compiler performs optimizations to make the code more efficient. These optimizations can improve execution speed, reduce memory usage, or minimize the size of the generated machine code.
Optimization happens at the IR level to ensure platform independence.
This step computes constant expressions at compile time, removes unreachable or unused code, and eliminates repeated computations of the same value. It also replaces expensive operations with cheaper ones.
Optimized IR after computing constant expressions and dead code elimination:
main:
b = 15
x = b
call print, x
return 0Let’s see the optimized IR in terminal, using:
clang -O2 -S -emit-llvm source.cpp -o source.llThis command will apply the -O2 optimization level and output the optimized LLVM IR to source_opt.ll.
Notice how concise it is from the unoptimized one in the previous step.
7. Assembly code generation
After Intermediate Code Generation and Optimization, the next step in the compilation process is Assembly Code Generation.
In this step, the intermediate representation (IR) is translated into a low-level assembly code specific to the target machine architecture. The assembly code is still human-readable but is closely tied to the specific instructions of the CPU.
To get assembly code for my current program, I will run
clang -S source.cpp -o source.s in my terminal. This command will save the assembly code to source.s file.
The following is the output I got. Your output might differ, based on the architecture of your device.
8. Machine code
Machine code is the final output of the compilation process and is composed of binary instructions that are directly executable by the computer’s CPU. These instructions correspond to the specific operations that the hardware can perform.
In contrast to assembly code, which is human-readable and tied to specific CPU architecture, machine code is in binary format (0s and 1s), and it is the only form that the CPU can understand natively.
Machine code is typically stored in object files (like .o or .obj), which contain the compiled binary representation of the program. This is the output after the assembly code generation step.
9. Linking
After compiling and generating object files (e.g., source.o), the linker takes these object files and combines them to create the final executable program.
If your program uses external libraries (like cout from the C++ standard library), the linker resolves the references and links those libraries to your program.
In addition, the linker may combine multiple object files and ensure that external functions or variables (like those from the standard library) are properly linked to your program’s code.
When you compile a program without specifying an output file name, the linker generates the default executable file named a.out.
To get the executable, run the following code in terminal:
g++ source.cppThis will generate an a.out file in the directory.
10. Loading
After linking, the loader takes the final executable and loads it into the system’s memory to prepare it for execution. It does the following:
Loads the program into the appropriate memory locations (typically in the RAM).
Resolves addresses for variables, functions, and other components.
Sets up the stack, heap, and other memory regions needed for execution.
Loads any dynamic libraries or shared objects that the program depends on (e.g.,
libstdc++for C++ programs).
11. Execution
Finally, once the program is loaded into memory, the operating system hands control over to the program’s entry point, usually the main function. The program starts executing the instructions in sequence, beginning with main and progressing through the program's logic, until it terminates (either successfully or with an error).
During execution, if any errors are encountered (like runtime errors), the program might crash or throw exceptions, depending on how errors are handled in the code.
Once the program has completed its tasks (e.g., printing output, performing calculations), it exits, and the process terminates.
Conclusion
We went through all the steps involved in the compilation processes and hopefully made the blackbox a little bit more transparent.
Back when I wanted to learn about compilers, this video really helped:
, and this channel is also an amazing resource: https://www.youtube.com/@frameofessence.
By understanding these steps, we can better debug, optimize, and appreciate the work done behind the scenes in making our code run.
We’re so grateful to
for allowing us to share her story here on Code Like A Girl. You can find her original post on Medium here.If you enjoyed this piece, we encourage you to visit her publication and subscribe to support her work!
Join Code Like a Girl on Substack
We publish 2–3 times a week, bringing you:
Technical deep-dives and tutorials from women and non-binary technologists
Personal stories of resilience, bias, breakthroughs, and growth in tech
Actionable insights on leadership, equity, and the future of work
Since 2016, Code Like a Girl has amplified over 1,000 writers and built a thriving global community of readers. What makes this space different is that you’re not just reading stories, you’re joining a community of women in tech who are navigating the same challenges, asking the same questions, and celebrating the same wins.
Subscribe for free to get our stories, or become a paid subscriber to directly support this work and help us continue amplifying the voices of women and non-binary folks in tech. Paid subscriptions help us cover the costs of running Code Like A Girl.















Great article. I've built 3 compilers in my life. Pascal 1987. ADA 1993. Java 2010ish. Java was rough. Memory. Rough. But you might enjoy one of my posts. Hope you enjoy. https://open.substack.com/pub/bdmehlman/p/throwing-caution-to-the-wind-1993