Compilers Practicum - Compilers Assignment 1

Compilers Assignment 1 — Dead Code Elimination

Project Overview

Compilers Assignments 1 through 5 will direct you to design and build an optimizing compiler for Cool. Each assignment will cover an intermediate step along the path to a complete optimizing compiler.

You may do this assignment in OCaml, Haskell, JavaScript, Python or Ruby. (There are no language restrictions for Compilers Assignments.) (If you don't know what to do, OCaml and Haskell are likely the best languages for the compiler and optimizer. Python is a reasonable third choice.)

You may work in a team of two people for this assignment. You may work in a team for any or all subsequent Compilers Assignments. You do not need to keep the same teammate. The course staff are not responsible for finding you a willing teammate.

Goal

For this assignment you will write a dead code eliminator for three-address code based on live variable dataflow analysis of single methods. Your program will serialize and deserialize files in Cool Three-Address Code format.

The Specification

You must create three artifacts:

A program that takes a single command-line argument (e.g., file.cl-tac). That argument will be an ASCII text Cool Three Address Code file (as described in below) corresponding to a single method. The cl-tac file will always be well-formed (i.e., there will be no errors in the cl-tac file).
Your program must a revised, valid .cl-tac file to standard output. The output should be the same as the input but with dead code removed. Your program will consist of a number of source files, all in the same programming language.
A plain ASCII text file called readme.txt describing your design decisions and choice of test cases. See the grading rubric. A paragraph or two should suffice.
Testcases test1.cl-tac and test2.cl-tac. These tests should exercise corner cases of live variable analysis and dead code elimination.

Three Address Code Example

Three-address code is a simplified representation focusing on assignments to variables and unary or binary operators.

The code below computes (1+3)*7 prints out the result, 28:

a <- int 1
b <- int 3
c <- + a b 
d <- int 7
e <- * c d 
retval <- call out_int e 
return retval

In three address code the expression (1+3)*7 must be broken down so that the addition and multiplication operations are each handled separately. (The exact format is described in the next section.)

As a second example, the code below reads in an integer and computes its absolute value (by checking to see if it is negative and subtracting it from zero if so):

x <- call in_int
z <- int 0
b <- < x z 
bt b is_negative
output <- x
jmp do_the_printing
label is_negative
output <- - z x 
label do_the_printing
retval <- call out_int output 
return retval

Note the use of labels, conditional branches and unconditional jumps. You can use the cool reference compiler to evaluate .cl-tac files:

$ echo -88 | ./cool.exe absval.cl-tac
88

The `.cl-tac` File Format

Your program should serialize and deserialize file.cl-tac files representing three-address code.

The format of .cl-tac files is designed to be easy to serialize and deserialize — you can process it in just a few lines, without needing special parsing tools.

A .cl-tac file is text-based and line-based. Your program must support Unix, Mac and Windows text files — regardless of where you, personally, developed your program. Note that lines can be delineated by carriage returns, new lines, or both (e.g., the regular expression [\r\n]+ may be helpful).

Possible line contents are:

Arithmetic (e.g., "x = y + z"):

x <- + y z
x <- - y z
x <- * y z
x <- / y z
Comparisons (e.g., "x = y < z"):
x <- < y z
x <- <= y z
x <- = y z
Constants:
x <- int integer
x <- bool boolean
x <- string\nstring-on-next-line\n
Note that the string constant will appear on the next line. The string constant assignment is the only three-address form that takes two lines.
Boolean Negation:
x <- not y
Arithmetic Negation:
x <- ~ y
Object Allocation:
x <- new type
Relevant types are Int, String, Bool and Object. See the Cool Reference Manual.
Object Default Value:
x <- default type
Null Check:
x <- isvoid y
Function Calls:
x <- call out_int y
x <- call out_string y
x <- call in_int
x <- call in_string
Branches, Labels and Control Flow:
jmp label
label label
return x
comment ...text until end of line...
bt x label
The bt instruction branches if the variable x is True (x must be a Bool); otherwise control falls through to the next instruction.

You can use cool --tac file.cl to produce file.cl-tac for the first method in that Cool file. Thus you can use the Cool reference compiler to produce three address code for you automatically, given Cool source code. You should do this to test your CA1 code.

Three-Address Code Deserialization

You should be able to use high-level code in the spirit of this example:

while not end of file
        read one line
        if line = "" or line[0] = "#" or line[0] = ";" then 
                skip    // comment
        words = split line by whitespace
        if words[0] = "label"
                answer += TAC_Label(words[1]) 
        else if words[0] = "jmp"
                answer += TAC_Jmp(words[1])
        else if ... // "bt", "return", "comment", etc. 
        // otherwise: x <- OP ... 
        else if words[2] = "+"
                answer += TAC_Plus(words[0], words[2], words[3])
        else if words[2] = "string"
                next_line = read one line
                answer += TAC_String(words[0], next_line) 
        else if ... // all other cases

Dead Code Elimination

A variable is live if it may be needed in the future. That is, an assignment to variable v at point p is live if there exists a path from p to the function exit along which v is read before it is overwritten.

For example:

a <- int 1
b <- int 3
c <- + a b 
d <- int 7
retval <- call out_int c 
return retval

On the fourth line, the assignment to d is dead because d is not subsequently used. Dead assignments can be removed. Often eliminating one piece of dead code can reveal additional dead code:

a <- int 1
b <- int 3
c <- + a b 
d <- int 7
e <- + d d 
retval <- call out_int c 
return retval

In above example, the assignment to d is not dead (yet!) but the assignment to e is dead. Once the assignment to e is eliminated, the assignment to d becomes dead and can then be eliminated. Thus, live variable analysis and dead code elimination must be repeated until nothing further changes.

You will use the live variable analysis form of data-flow analysis to determine which assignments are dead code.

Function calls with I/O side effects should never be eliminated.

Basic Blocks and Control-Flow Graphs

A basic block is a sequence of instructions with only one control-flow entry and one control-flow exit. That is, once you start executing the instructions in a basic block, you must execute all of them — you cannot jump out early or jump in half-way through.

A local dataflow analysis operates on basic blocks.

As part of this assignment you will also implement a global dataflow analysis that operates on multiple basic blocks (that together form one entire method). A control-flow graph is a graph in which the nodes are basic blocks and the edges represent potential control flow.

Commentary

Your assignment will have the following stages:

Deserialize the input .cl-tac file
Identify basic blocks and construct the control-flow graph
Repeat until nothing changes:
1. Perform global live variable analysis on the control-flow graph
  - This involves local live variable analysis on individual statements or basic blocks
2. Eliminate dead code
Serialize the result to standard output in .cl-tac format

You can do basic testing as follows:

$ cool --tac file.cl --out original
$ cool --tac --opt file.cl --out ref_optimized
$ my-dce original.cl-tac > my_optimized.cl-tac
$ cool ref_optimized.cl-tac > optimized.output
$ cool my_optimized.cl-tac > my.output
$ diff optimized.output my.output
$ cool --profile ref_optimized.cl-tac | grep STEPS
$ cool --profile my_optimized.cl-tac | grep STEPS

Passing --opt and --tac to the reference compiler will cause the reference compiler to perform dead code elimination before emitting the .cl-tac file.

What To Turn In For CA1

You must turn in a zip file containing these files:

readme.txt -- your README file
test1.cl-tac -- a tricky testcase
test2.cl-tac -- a tricky testcase
source_files -- including
- main.rb or
- main.py or
- main.js or
- main.hs or
- main.ml

Your zip file may also contain:

team.txt -- an optional file listing your other team member (see below -- if you are not working in a team, do not include this file)

Submit the file to the course autograding website.

Working In Pairs

You may complete this project in a team of two. Teamwork imposes burdens of communication and coordination, but has the benefits of more thoughtful designs and cleaner programs. Team programming is also the norm in the professional world.

Students on a team are expected to participate equally in the effort and to be thoroughly familiar with all aspects of the joint work. Both members bear full responsibility for the completion of assignments. Partners turn in one solution for each programming assignment; each member receives the same grade for the assignment. If a partnership is not going well, the teaching assistants will help to negotiate new partnerships. Teams may not be dissolved in the middle of an assignment. If your partner drops the class at the last minute you are still responsible for the entire assignment.

If you are working in a team, exactly one team member should submit a CA1 zipfile. That submission should include the file team.txt, a one-line flat ASCII text file that contains the email address of your teammate. Don't include the @virgnia.edu bit. Example: If ph4u and wrw6y are working together, ph4u would submit ph4u-ca1.zip with a team.txt file that contains the word wrw6y. Then ph4u and wrw6y will both receive the same grade for that submission.

Autograding

We will use scripts to run your program on various testcases. The testcases will come from the test1.cl-tac and test2.cl-tac files you and your classsmates submit as well as held-out testcases used only for grading. Your programs cannot use any special libraries (aside from the OCaml unix and str libraries, which are not necessary for this assignment). We will use (loosely) the following commands to execute them: ghc --make -o a.out *.hs ; ./a.out testcase.cl-tac >& testcase.out node main.js testcase.cl-tac >& testcase.out ocamlc unix.cma str.cma *.ml ; ./a.out testcase.cl-tac >& testcase.out python main.py testcase.cl-tac >& testcase.out ruby main.rb testcase.cl-tac >& testcase.out You may thus have as many source files as you like (although two or three plus your parser definition should suffice) -- they will be passed to your language compiler in alphabetical order (if it matters). Note that we will not run the parser generator for you -- you should run it and produce the appropriate ML, Python or Ruby file and submit that.

In each case we will then compare your output to the correct answer, and then compare the size of your output to a size threshhold. If your dead code eliminator changes the meaning of the program, you will get 0 points for that testcase. Otherwise, you get 1 point for that test case if you eliminate "enough" dead code to meet the threshhold.

We will perform the autograding on some unspecified test system. It is likely to be Solaris/UltraSPARC, Cygwin/x86 or Linux/x86. However, your submissions must officially be platform-independent (not that hard with a scripting language). You cannot depend on running on any particular platform.

There is more to your grade than autograder results. See the Compilers Assignment page for a point breakdown.

Your submission may not create any temporary files. Your submission may not read or write any files beyond its input and output. We may test your submission in a special "jail" or "sandbox".