next up previous contents
Next: Acknowledgements Up: The Cool Reference Manual1 Previous: Operational Rules   Contents


Cool Assembly Language

Cool Assembly Language is a simplified RISC-style assembly language that is reminiscient of MIPS Assembly Language.

A Cool Assembly Language program is a list of instructions. Each instruction may be preceeded by any number of labels. Comments follow the standard Cool conventions. In addition, a semicolon ; functions like a double dash -- in that it marks the rest of that line as a comment. The Cool CPU is a load-store architecture with eight general purpose registers and three special-purpose registers. For simplicity, a machine word can hold either a 32-bit integer value or an entire raw string; regardless, all machine words have size one.

This document assumes that you already have some familiarity with assembly language, registers, and how CPUs operate. We first present a formal grammar and then explain the semantics. Only terms in typewriter font are part of the formal grammar. Text after — is a comment. We use italics for non-terminals.

register ::= r0 — general purpose register #0, often used as the accumulator
register ::= r1 — general purpose register #1
register ::= r2
register ::= r3
register ::= r4
register ::= r5
register ::= r6
register ::= r7
register ::= sp — stack pointer register
register ::= fp — frame pointer register
register ::= ra — return address register

instruction ::= li register <- integer — load immediate
instruction ::= mov register <- register — register-to-register copy
instruction ::= add register <- register register
instruction ::= sub register <- register register
instruction ::= mul register <- register register
instruction ::= div register <- register register

instruction ::= jmp label — unconditional branch
instruction ::= bz register label — branch if equal to zero
instruction ::= bnz register label — branch if not zero
instruction ::= beq register register label — branch if equal
instruction ::= blt register register label — branch if less than
instruction ::= ble register register label — branch if less than or equal to
instruction ::= call label — direct function call
instruction ::= call register — register-indirect function call
instruction ::= return — function return

instruction ::= push register — push a value on the stack
instruction ::= pop register — push a value off the stack
instruction ::= ld register <- register [ integer ] — load a value from memory
instruction ::= st register [ integer ] <- register — store a value into memory
instruction ::= la register <- label — load an address into a register

instruction ::= alloc register register — allocate memory
instruction ::= constant integer — lay out a compile-time constant in memory
instruction ::= constant raw_string — lay out a compile-time constant in memory
instruction ::= constant label — lay out a compile-time constant in memory

instruction ::= syscall name — request a service from the run-time system

instruction ::= debug register — debugging support: print register value
instruction ::= trace — toggle tracing

That's it, and the last two do not really count. We next describe the interpretation of these instructions in more detail.

The system calls available are:

That system calls correspond directly to internal predefined methods on Cool Int and String objects. The key difference is that the system calls work on raw values (i.e., machine-level ints and strings) and not on Cool Objects.


Cool CPU Simulator

The normal Cool compiler executable (e.g., cool.exe) also serves as a Cool CPU Simulator that executes Cool Assembly Language programs. Just pass file.cl-asm as an argument.

The simulator performs the following actions:

  1. Load the .cl-asm program into memory starting at address 1000. That is, if the first instruction in file.cl-asm is mov r1, r2, then memory location 1000 will hold the instruction mov r1, r2. If the second instruction in file.cl-asm is constant 55, then memory location 1001 will hold the integer 55.
  2. Set sp and fp to 2,000,000,000. Remember, the stack starts at high addresses and grows down.
  3. Search file.cl-asm for a label named start. The program counter is set to the address corresponding to that label. For example, if start: occurs just before the third instruction in file.cl-asm, then the program counter starts at 1002.
  4. Fetch the instruction pointed to by the program counter and execute it. Unless the instruction specifies otherwise, the program counter is incremented by one and the process repeats.

  5. When memory is allocated (e.g., by the alloc instruction), addresses starting from at least 20,000 are used.
  6. If more than 1000 call instructions are executed before any return instructions are executed (i.e., if there are more than 1000 calls on the stack), the simulator terminates and prints a stack overflow error.
The constant values listed above (1000; 20,000; 2,000,000,000) should not be counted on by your program, but are listed here to help with debugging. Addresses near 1000 hold program instructions or compile-time data (i.e., the code segment), addresses near 20,000 hold the heap, and addresses near two billion are on the stack.


Debugging

Debugging assembly language programs is notoriously difficult! While writing your code generator, you will spend quite a bit of time running generated Cool Assembly programs through the Cool CPU Simulator to see if they work. Often they will not. The Cool CPU Simulator has been designed with a large number of features to aid debugging. Basically none of these features are present in traditional assemblers, so you actually have a wealth of debugging support, but it will still be difficult.


Performance Model

As discussed above, the Cool reference compiler also includes a reference machine simulator to interpret Cool Assembly Language instructions. This simulator can be invoked directly by passing a .cl-asm file to cool.exe:
cool$ cat hello-world.cl
class Main {
  main():Object {
    (new IO).out_string("hello, world!\n")
  };
};
cool$ ./cool --asm hello-world.cl
cool$ ./cool hello-world.cl-asm 
hello, world!
The simulator can also give detailed performance information:
 
cool$ ./cool --profile hello-world.cl-asm 
hello, world!
PROFILE:           instructions =        107 @    1 =>        107
PROFILE:        pushes and pops =         29 @    1 =>         29
PROFILE:             cache hits =         22 @    0 =>          0
PROFILE:           cache misses =        570 @  100 =>      57000
PROFILE:     branch predictions =          0 @    0 =>          0
PROFILE:  branch mispredictions =         11 @   20 =>        220
PROFILE:        multiplications =          0 @   10 =>          0
PROFILE:              divisions =          0 @   40 =>          0
PROFILE:           system calls =          2 @ 1000 =>       2000
CYCLES: 59356
The execution time of a Cool Assembly Language program is measured in simulated instruction cycles. In general, each assembly instruction takes one cycle. Some instructions, such as system calls or memory operation, can cost many more cycles. The total cycle cost of a program is the sum of all of its component cycle costs.

In modern architectures, memory hierarchy effects (e.g., caching) and branch prediction are dominant factors in the execution speed of a program. To give you a flavor for what real-world code optimization is like, the Cool Simulator also simulates a cache and a branch predictor.

The Cool Simulator features a 64-word least-recently-used fully associative combined instruction and data cache. It also uses a static backward = taken, forward = not taken branch prediction scheme.

We now discuss each of the performance components in turn:

  1. instructions. Each Cool Assembly Language instruction executed costs at least one cycle. This represents the time taken to fetch and decode the instruction, as well as to shepherd it through the pipeline. Instructions such as li, mov and add take one cycle.
  2. pushes and pops. Such push and pop involve both a load/store and also an add/sub, each costs an additional cycle (for a total of two). (push and pop can also incur cache miss penalties; see below.)
  3. cache hits & misses. In modern computers, the CPU executes much faster than main memory: hundreds of "normal" instructions can be executed in the time it takes to fetch one value from memory. To mitigate this problem, a small number of values are placed in expensive, high-speed memory near the CPU. This small, fast memory stores recently-used values and is known as a cache. The Cool Simulator features a 64-word fully-associated cache: the values associated with 64 addresses can be accessed rapidly. If a memory read or write accesses an address that is in the cache, the instruction completes immediately with no extra cost. If a memory read or write accesses an address that is not in the cache, it costs 100 cycles while that value is read in from main memory. If there is no room in the cache to hold that new address's value, the address that has been touched (read or written) least recently is evicted and the new address/value is put in its place. Typical reasons for cache misses include compulsory, capacity and conflict.

    Note that the cache and the cache miss penalty apply to every access to memory. This includes:

  4. branch prediction & misprediction. In a modern pipelined CPU, the next instruction is fetched before the current instruction has completed. This means that the CPU needs to know the address of the next instruction as early as possible. For a conditional branch, that may be difficult: the CPU may have to wait until the comparison is complete to determine if the next instruction will be at pc+1 or label. Modern CPUs optimistically "guess" or "predict" that a branch will go one way or the other and then rollback instructions if they are wrong. A correctly-predicted branch costs nothing; a mispredicted branch costs 20 cycles. The following instructions are related to this cost:
  5. multiplication & division. Integer multiplication and division take longer on most architectures than addition and subtraction. In the Cool Simulator, mul costs an extra 10 cycles and div costs an extra 40.
  6. system calls. A system call involves trapping to the operating system, switching CPU protection contexts, putting the old process on the scheduling queue, handling the operation, rescheduling the new process, and switching CPU protection contexts again. System calls take forever. In the Cool Simulator, each syscall instruction takes 1000 extra cycles.
This cost model involves realistic components but potentially unrealistic values (e.g., a modern CPU would have a much larger non-associative cache, and also a much larger cache miss cost). If you're interested in that sort of performance modeling, take a graduate class in computer architecture. You should know that this CPU performance model is one of the most realistic that I've seen for a compiler optimization project in terms of the issues that it forces you to address.

The reference compiler includes a simple reference peephole optimizer, the --opt flag. You can use it to get an idea for how to get started (but note that we are evil and strip all comments from the optimized output).

 
yuki:~/src/cool$ ./cool --opt --asm hello-world.cl
yuki:~/src/cool$ ./cool --profile hello-world.cl-asm 
hello, world!
PROFILE:           instructions =         79 @    1 =>         79
PROFILE:        pushes and pops =         23 @    1 =>         23
PROFILE:             cache hits =         15 @    0 =>          0
PROFILE:           cache misses =        513 @  100 =>      51300
PROFILE:     branch predictions =          2 @    0 =>          0
PROFILE:  branch mispredictions =          7 @   20 =>        140
PROFILE:        multiplications =          0 @   10 =>          0
PROFILE:              divisions =          0 @   40 =>          0
PROFILE:           system calls =          2 @ 1000 =>       2000
CYCLES: 53542
For the hello-world program, the optimizer reduces the cycle cost from 59356 to 53453 — a 10% improvement. The reference optimizer is only 80 lines of code. If you are writing an optimizer, you will want to do better than the reference on basically all input programs. Notably, you'll probably want to implement the dataflow analysis and optimization framework described in class.