CA3 — Machine Dependent Assembly Generation

Project Overview

CA3 is due 3/3 at 11PM.

For this assignment, you will write a program that generates machine-dependent assembly instructions for certain simple Cool programs.

You may do this assignment in OCaml, Python, JavaScript, Haskell or Ruby.

You may work in a team of two people for this assignment. You may work in a team for any or all subsequent programming assignments. You do not need to keep the same teammate. The course staff are not responsible for finding you a willing teammate.

Goal

For this assignment you will write a correct code generator for a subset of Cool. Among other things, this involves implementing the operational semantics specification of Cool. You do not have to track information to generate run-time errors (e.g., division by zero, dispatch on void). You do not have to worry about "malformed input" because the semantic analyzer (from PA4) will rule out bad programs.

You will also write additional code to unserialize the class map, implementation map, parent map, and annotated AST produced by the semantic analyzer.

The Specification

You must create three artifacts:

A program that takes a single command-line argument (e.g., file.cl-type). That argument will be an ASCII text Cool class map, implementation map, parent map and annotated AST file (as described in PA4). Your program must emit either Cool Assembly Language (file.cl-asm) or x86-64 Assembly Language (file.s). Your program will consist of a number of files in a single programming language.

If you are targeting Cool Assembly Language, then executing file.cl-asm in a Cool CPU Simulator must produce the correct output for file.cl according to Cool's operational semantics.
If you are targeting x86-64, then compiling file.s with gcc on a 64-bit x86 Linux system must produce an executable that, when run, produces the correct output for file.cl according to Cool's operational semantics.
You will only be given .cl-type files from programs that pass the semantic analysis phase of the reference compiler. You are not responsible for correctly handling (1+"hello") programs.

A plain ASCII text file called readme.txt describing your design decisions and choice of test cases. See the grading rubric. A few paragraphs should suffice.
Testcases test1.cl and test2.cl. The testcases should exercise compiler corner cases.

Assembly Language

You have your choice of either Cool Assembly Language or x86-64 Assembly Language for this project. You may pick either one, or even switch between assignments if you don't like your initial choice (but the deadlines remain unchanged).

Cool Assembly Language is a simplified RISC-style assembly language that is similar to Java Bytecode.

Pro: It is dynamically "well-typed", so if you produce assembly code that tries to add a string to a code label, the system will notice immediately (rather than producing garbage and unhelpfully crashing later).
Pro: It is portable and works anywhere the Reference Compiler works.
Pro: It abstracts away notions like system calls and string handling so that you can focus on code generation and layout.
Pro: It has just enough instructions to directly compile Cool and no more — learning it is simple.
Pro: It includes built-in debugging and tracing suppot.
Con: Because it is a simualted bytecode, you cannot use GDB or Visual Studio to debug your output.
Con: Because it is a simulated bytecode, you cannot actually "run the raw .exe" on your computer.

x86-64 Assembly Language is the real deal — it's what most desktop machines use.

Pro: You get a strong feeling of accomplishment for working close to "the bare metal" and using a real-world assembly language. Employers know that writing an optimizing compiler that targets x86 is not easy!
Pro: You can use GDB or Visual Studio to step through and debug your resulting executable.
Pro: For the subsequent optimization assignment, you can use any instruction, power or trick that is legal on x86-64 to squeeze every last drop of performance out of your code.
Pro: Students who successfully complete the x86-64 optimizing compiler are in an excellent position to be asked to join a research group and receive glowing letters of recommendation.
Con: There are many dialect and syntax options of "x86 assembly language". This course only allows the 64-bit version supported by gcc on linux. If you can't run that on your development machine, you must use Cool Assembly.
Con: You must handle all of the details, including laying out strings in memory and emitting assembly code that interfaces with standard library functions like malloc or scanf or exit to implement Cool's operational semantics.
Con: You'll have to use GDB (or something similar) because there's no innate debugging support.
Con: x86 assembly is infamous for being hideous. It is a serious commitment.

To find out if your development machine supports our flavor of x86-64 assembly language, try this:

./cool --x86 hello-world.cl
gcc hello-world.s
./a.out

If there were no assembler or linker errors and you saw "Hello, world.", your system can handle our x86-64 assembly. If not, you must use Cool Assembly Language.

Although we're invoking gcc, that's just shorthand: it's calling as and collect2 and whatnot to assemble the .s file into an object file and link that object file into an executable. You can see the individual steps with gcc -v.

No Error Reporting

For this assignment you do not have to generate assembly code that checks for or reports run-time errors. (You'll do that for the next assignment.) However, you can save yourself from subsequent regret by remembering that you will eventually have to do so (e.g., don't throw away that line-number information!).

Restricted Subset of Cool

For this assignment, you only have to handle simple Cool programs that use a subset of the possible expressions. The input program will introduce only one class, Main, with only one method, main. The following AST or language elements will not appear:

attribute_*
self_dispatch — except IO (e.g., in_out, out_int)
static_dispatch
dynamic_dispatch
internal functions
while
isvoid
case
new
let_binding_init
any novel class/method — except "Main.main"
run-time errors (e.g., division by zero)
strings — programs will largely involve integers and booleans

This results in single-class, single-method Cool programs focusing on integer and boolean manipulation.

Commentary

I strongly recommend that you convert to a control-flow graph where the basic blocks are three-address code (e.g., as in previous Compilers Assignments) and convert that three-address code into assembly.

You will have to handle certain internal functions (e.g., IO.out_string) that are defined in PA4.

You can do basic testing as follows:

$ cool --asm file.cl-type
$ cool file.cl-asm >& reference-output
$ my-code-generator file.cl-type 
$ cool file.cl-asm >& my-output
$ diff my-output reference-output

Whitespace and newlines do not matter in your file.{cl-asm,s} assembly code. However, whitespace and newlines do matter for your simulated Cool CPU output. This is because you are specifically being asked to implement IO (and, later, substring) functions.

You should implement all of the relevant operational semantics rules in the Reference Manual. You will also have to implement all of the relevant built-in functions on the IO class.

What To Turn In For CA3

You must turn in a zip file containing these files:

readme.txt -- your README file
test1.cl -- a testcase
test2.cl -- a testcase
source_files -- your implementation, including

main.rb or
main.py or
main.js or
main.hs or
main.ml

Your zip file may also contain:

team.txt -- an optional file listing your other team member (see below -- if you are not working in a team, do not include this file)

Submit the file as you did for CA1.

Working In Pairs

You may complete this project in a team of two. Teamwork imposes burdens of communication and coordination, but has the benefits of more thoughtful designs and cleaner programs. Team programming is also the norm in the professional world.

Students on a team are expected to participate equally in the effort and to be thoroughly familiar with all aspects of the joint work. Both members bear full responsibility for the completion of assignments. Partners turn in one solution for each programming assignment; each member receives the same grade for the assignment. If a partnership is not going well, the teaching assistants will help to negotiate new partnerships. Teams may not be dissolved in the middle of an assignment.

If you are working in a team, exactly one team member should submit a CA3 zipfile. That submission should include the file team.txt, a one-line flat ASCII text file that contains exactly and only the email address of your teammate. Don't include the @virginia.edu bit. Example: If ph4u and wrw6y are working together, ph4u would submit ph4u-ca3.zip with a team.txt file that contains the word wrw6y. Then ph4u and wrw6y will both receive the same grade for that submission.

Autograding

We will use scripts to run your program on various testcases. The testcases will come from the test.cl files you and your classsmates submit as well as held-out testcases used only for grading. Your programs cannot use any special libraries (aside from the OCaml unix and str libraries, which are not necessary for this assignment). We will use (loosely) the following commands to execute them:

ghc --make -o a.out *.hs ; ./a.out testcase.cl-type >& testcase.out node main.js testcase.cl-type >& testcase.out ocamlc unix.cma str.cma *.ml ; ./a.out testcase.cl-type >& testcase.out python main.py testcase.cl-type >& testcase.out ruby main.rb testcase.cl-type >& testcase.out

You may thus have as many source files as you like (although two or three should suffice) -- they will be passed to your language compiler in alphabetical order (if it matters).

In each case we will then compare your output to the correct answer:

diff testcase.out correct-answer.out

Note that this time we do not ignore newlines and whitespace since we are explicitly testing your implementation of a string IO subsystem. You must get every character correct in non-error instances. If your answer is not the same as the reference answer you get 0 points for that testcase. Otherwise you get 1 point for that testcase.

Grading Rubric

Grading (out of 40)

30 points for autograder tests

+.2 point for each passed test case (300 total)
0 point for each failed test case (i.e., occurrences of 'FAILED')

8 points for valid test1.cl and test2.cl files

+4 points each — high quality tests
+2 points each — valid tests that do not stress code generation
+0 points — test cases showing no effort

4 points for README.txt

+4 points — clear, thorough description of design
+2 points — vague, unclear description omitting details
+0 points — little effort or if you submit an RTF/DOC/PDF instead of a plain TXT faile

If you are in a group: -10 points if you do not submit a correct team.txt file