This program assembles Hack, a simple assembly language from The Elements of Computing Systems (aka Nand to Tetris). Hack comprises of three kinds of instructions:
- A(ddress) instructions begin with
@and load a 15-bit memory address into the CPU’s A register
- C(ompute) instructions begin with either a destination or computation code and tell the CPU which registers to read from, what to do with those values, where to store the result, and whether to jump elsewhere in the code.
- L(abel) instructions are enclosed in parentheses. These are pseudocommands since they are used to label line numbers to enable jumping, but they are not themselves converted into assembly.
- the most significant bit</span> (set to 0) identifies the instruction as an A instruction
- the least significant 15 bits represent a 15-bit unsigned integer
C(omputation) Instructions have three component parts that are encoded into the 16-bit instruction string. For sample C-instructions like “AD=M+1” or “D;JEQ” or “D=1” or “D=M-1;JLT”:
- everything to the left of the
=is the destination part of the command – i.e., where the ALU’s output needs to go. (D is the destination in the command
D=M+1.) This piece of information gets encoded as a 3-bit sequence and spans bits 3-5 of the resulting instruction string.
- everything to the right of the
=and to the left of the
;is the computation command – i.e., what computation the ALU performs on its inputs. (M+1 is the computation in the command
D=M+1.) There are a lot of options here, so this piece of information gets encoded as a 7-bit sequence and spans bits 6-12 of the resulting instruction string.
- everything right of the
;is the jump command, which, if present, loads the value in the A Register (in this case representing a program line number) into the Program Counter, thus “jumping” to another line in the program. (JGT specifies the jump in the command
D;JGT.) This piece of information gets encoded as a 3-bit sequence and spans bits 0-2 in the resulting instruction string.
- the most significant bit (set to 1) identifies the instruction as a C Instruction
- bits 13 and 14 aren’t used
- the a bit (bit 12) dictates whether the ALU should accept its
yinput from the A Register or from somewhere in memory. This is effectively part of the computation instruction.
- the 6 c bits (bits 6-11) specify the ALU function and are fed into its 6 function inputs
- the 3 d bits (bits 3-5) specify the destination for the ALU’s output, either
A(A register), or
D(D register), or some combination of the three.
- the 3 j bits (bits 0-2) specify any jump command based on whether the ALU’s output is
>= 0, or
The binary encodings for each of these component parts are as follows:
Computation (bits 6-12)
Destination (bits 4-6)
|0||0||0||null||Don't store anywhere|
|0||1||1||MD||Memory[A], D Register|
|1||0||1||AM||A Register, Memory[A]|
|1||1||0||AD||A Register, D Register|
|1||1||1||AMD||A Register, Memory[A], D Register|
Jump (bits 1-3)
|0||0||1||JGT||Jump if out > 0|
|0||1||0||JEQ||Jump if out = 0|
|0||1||1||JGE||Jump if out >= 0|
|1||0||0||JLT||Jump if out < 0|
|1||0||1||JNE||Jump if out != 0|
|1||1||0||JLE||Jump if out <= 0|
L(abel) Instructions are pseudo-instructions, since they do not directly translate into equivalent lines of machine code. Instead, when L Instructions are encoundered, they are added to a symbols table as references to corresponding line numbers.
Two Pass Approach
This assembler performs two passes on an input .asm file.
The first pass identifies L Instructions and adds them to a symbols table with a corresponding line number, so that jump instructions specifying that label know to which line of machine code to jump.
The second pass handles A and C instructions. When symbols are encountered, these are added to the symbols table with an arbitrarily assigned memory address such that each symbol corresponds to a unique address.
An assembler must maintain a list of symbols, both for A Instructions and L Instructions.
During the first pass, if the assembler encountered an L Instruction – e.g.,
(MY_LABEL) – it will insert this label into the symbols table as a key with the corresponding line number as its matching value. When the assembler encounters a label on its second pass, it need only look up
MY_LABEL in the symbols table to access its matching value (which, in the case of labels, corresponds to a line number for use in jump instructions).
A similar process occurs when the assembler encounters symbols in A Instructions. For instance, if the assembler reads in a line like
@my_var, it will attempt to look that symbol up in a symbols table to find the corresponding value (in this case, a memory location). If
my_var does not exist in the symbols table, then it is arbitrarily assigned an unused address location.
To handle lookups, this assembler implementation uses a linked list, whose nodes have key/value pair members. When the assembler encounters a label on its first pass, it appends that label and corresponding line number to the symbols linked list (a constant time operation, since the linked list struct holds a reference to both head and tail nodes). When the assembler encounters labels or symbols on its second pass, it performs a linear time lookup. If found, the matching value is returned, otherwise a new node is appended to the end of the linked list.
A proper hash table with constant time lookups would be more performant – I plan to implement this functionality in a future version.
C Instruction Lookups
The assembler must also be able to convert C Instruction tokens to their corresponding integer codes. For instance, destination command
MD corresponds to
0b011), and the computation commany
A+1 corresponds to
0b0110111). To handle lookups for destination, jump, and computation instrutions, this assmbler implementation utilizes parallel arrays to hold key/value pairs. In practice, this involves linear time lookups, however the performance hit is nominal since these arrays are small and constant size.
Before using the assembler, compile it using the included
makefile. From the root directory, run
make, which will compile the project and all its dependencies, producing a variety of object files and an executable named
Once the assembler has been compiled, you can use it to translate your assembly code into machine code. The assembler takes one command line argument, which is the path to the assembly file you want to translate. The command follows this format:
path/to/your/assembly_file.asm with the actual path to your assembly file.
After running the above command, the assembler will process the assembly file and generate the corresponding machine code. In addition to being printed to stdout, the output will be saved to a new file with the same name as the input assembly file, but with a different extension (e.g., .hack). You can find this output file in the same directory where your assembly file is located.
To remove executable and all dependent object files, run
Here’s an example of how you could use the assembler to translate an assembly file named “example.asm”:
After running this command, you will find the assembled machine code in a file named “example.hack” in the same directory.
> ./assembler path/to/Rect.asm 0: @0 --> 0000000000000000 1: D=M --> 1111110000010000 2: @INFINITE_LOOP --> 0000000000010111 3: D;JLE --> 1110001100000110 4: @counter --> 0000000000010000 5: M=D --> 1110001100001000 6: @SCREEN --> 0100000000000000 7: D=A --> 1110110000010000 8: @address --> 0000000000010001 9: M=D --> 1110001100001000 10: (LOOP) --> 10: @address --> 0000000000010001 11: A=M --> 1111110000100000 12: M=-1 --> 1110111010001000 13: @address --> 0000000000010001 14: D=M --> 1111110000010000 15: @32 --> 0000000000100000 16: D=D+A --> 1110000010010000 17: @address --> 0000000000010001 18: M=D --> 1110001100001000 19: @counter --> 0000000000010000 20: MD=M-1 --> 1111110010011000 21: @LOOP --> 0000000000001010 22: D;JGT --> 1110001100000001 23: (INFINITE_LOOP) --> 23: @INFINITE_LOOP --> 0000000000010111 24: 0;JMP --> 1110101010000111 Output file: path/to/Rect.hack Took 0.001327 seconds
This assembler is a C port of the Python assembler I built during my Summer 2023 batch at the Recurse Center while completing The Elements of Computing Systems.