Programming Assignment 2 - The Lexer

Project Overview

Programming assignments 2 through 5 will direct you to design and build a compiler for Cool. Each assignment will cover one component of the compiler: lexical analysis, parsing, semantic analysis, and operational semantics. Each assignment will ultimately result in a working compiler phase which can interface with the other phases.

You may do this assignment in OCaml, Python or Ruby. You must use each language at least once (over the course of PA2 - PA5); you will use one language (presumably your favorite) twice.

You may work in a team of two people for this assignment. You may work in a team for any or all subsequent programming assignments. You do not need to keep the same teammate. The course staff are not responsible for finding you a willing teammate. However, you must still satisfy the language breadth requirement (i.e., you must be graded on at least one OCaml program, at least one Ruby program, and at least one Python program).

Goal

For this assignment you will write a lexical analyzer, also called a scanner, using a lexical analyzer generator. You will describe the set of tokens for Cool in an appropriate input format and the analyzer generator will generate actual code (in OCaml, Python or Ruby). You will then write additional code to serialize the tokens for use by later compiler stages.

The Specification

You must create three artifacts:
  1. A program that takes a single command-line argument (e.g., file.cl). That argument will be an ASCII text Cool source file. Your program must either indicate that there is an error in the input (e.g., a malformed string) or emit file.cl-lex, a serialized list of Cool tokens. Your program's main lexer component must be constructed by a lexical analyzer generator. The "glue code" for processing command-line arguments and serializing tokens should be written by hand. If your program is called lexer, invoking lexer file.cl should yield the same output as cool --lex file.cl. Your program will consist of a number of OCaml files, a number of Python files, or a number of Ruby files.
  2. A plain ASCII text file called readme.txt describing your design decisions and choice of test cases. See the grading rubric. A few paragraphs should suffice.
  3. Testcases good.cl and bad.cl. The first should lex correctly and yield a sequence of tokens. The second should contain an error.

Line Numbers

The first line in a file is line 1. Each successive '\n' newline character increments the line count. Your lexer is responsible for keeping track of the current line number.

Error Reporting

To report an error, write the string ERROR: line_number: Lexer: message to standard output and terminate the program. You may write whatever you want in the message, but it should be fairly indicative. Example erroneous input:
Backslash not allowed \

Example error report output:

ERROR: 1: Lexer: invalid character: \

The .cl-lex File Format

If there are no errors in file.cl your program should create file.cl-lex and serialize the tokens to it. Each token is represented by a pair (or triplet) of lines. The first line holds the line number. The second line gives the name of the token. The optional third line holds additional information (i.e., the lexeme) for identifiers, integers, strings and types.

Example input:

Backslash not 
        allowed

Example .cl-lex output:

1
type
Backslash
1
not
2
identifier
allowed

The official list of token names is:

In general the intended token is evident. For the more exotic names: at = @, larrow = <-, lbrace = {, le = <=, lparen = (, lt = <, rarrow = =>, rbrace = }, semi = ;, tilde = ~.

The .cl-lex file format is exactly the same as the one generated by the reference compiler when you specify --lex. In addition, the reference compiler (and your upcoming PA3 parser!) will read .cl-lex files instead of .cl files.

Lexical Analyzer Generators

The OCaml lexical analyzer generator is called ocamllex and it comes with any OCaml distribution.

A Ruby lexical analyzer generator called ruby-lex is available, but you must download it yourself.

A Python lexical analyzer generator called ply is available, but you must download it yourself.

All of these lexical analyzer generators are derived from lex (or flex), the original lexical analyzer generator for C. Thus you may find it handy to refer to the Lex paper or the Flex manual. When you're reading, mentally translate the C code references into the language of your choice.

My personal opinion is that the OCaml and Python tools are a bit more mature (i.e., easier to use) than the Ruby tools for this particular project, but feel free to prove me wrong. In addition, this is the programming project that will involve the least amount of "native coding", so if you have a least favorite language of the three you might consider using it for this project.

Commentary

You can do basic testing with something like the following: For example, if you used OCaml:

You may find the reference compiler's --unlex option useful for debugging your .cl-lex files.

Need more testcases? Any Cool file you have (including the one you wrote for PA1) works fine. The ten in the cool-examples.zip file should be a good start. There's also one among the PA1 hints. You'll want to make more complicated test cases -- in particular, you'll want to make negative testcases (e.g., testcases with malformed string constants).

What To Turn In For PA2

You must turn in a zip file containing these files:
  1. readme.txt -- your README file
  2. good.cl -- a positive testcase
  3. bad.cl -- a negative testcase
  4. source_files -- including
Your zip file may also contain: Submit the file as you did for PA1.

Working In Pairs

You may complete this project in a team of two. Teamwork imposes burdens of communication and coordination, but has the benefits of more thoughtful designs and cleaner programs. Team programming is also the norm in the professional world.

Students on a team are expected to participate equally in the effort and to be thoroughly familiar with all aspects of the joint work. Both members bear full responsibility for the completion of assignments. Partners turn in one solution for each programming assignment; each member receives the same grade for the assignment. If a partnership is not going well, the teaching assistants will help to negotiate new partnerships. Teams may not be dissolved in the middle of an assignment.

If you are working in a team, exactly one team member should submit a PA2 zipfile. That submission should include the file team.txt, a one-line, one-word flat ASCII text file that contains the email address of your teammate. Don't include the @virgnia.edu bit. Example: If ph4u and wrw6y are working together, ph4u would submit ph4u-pa2.zip with a team.txt file that contains the word wrw6y. Then ph4u and wrw6y will both receive the same grade for that submission.

This seems picayune, but in the past we've had students fail to correctly format this one word file. Thus you now get a point on this assignment for either formatting this file correctly (i.e., including only a single word that is equal to your partner's uva email ID) or not including it (and thus not working in a pair).

Autograding

We will use scripts to run your program on various testcases. The testcases will come from the good.cl and bad.cl files you and your classsmates submit as well as held-out testcases used only for grading. Your programs cannot use any special libraries (aside from the OCaml unix and str libraries, which are not necessary for this assignment). We will use (loosely) the following commands to execute them: You may thus have as many source files as you like (although two or three plus your lexer definition should suffice) -- they will be passed to your language compiler in alphabetical order (if it matters). Note that we will not run the lexical analyzer generator for you -- you should run it and produce the appropriate ML, Python or Ruby file and submit that.

In each case we will then compare your output to the correct answer:

If your answer is not the same as the reference answer you get 0 points for that testcase. Otherwise you get 1 point for that testcase.

For error messages and negative testcases we will compare your output but not the particular error message. Basically, your lexer need only correctly identify that there is an error on line X. You do not have to faithfully duplicate our English error messages. Many people choose to (because it makes testing easier) -- but it's not required.

We will perform the autograding on some unspecified test system. It is likely to be Solaris/UltraSPARC, Cygwin/x86 or Linux/x86. However, your submissions must officialy be platform-independent (not that hard with a scripting language). You cannot depend on running on any particular platform.

There is more to your grade than autograder results. See the Programming Assignment page for a point breakdown.

Your submission may not create any temporary files. Your submission may not read or write any files beyond its input and output. We may test your submission in a special "jail" or "sandbox".