Programming Assignment 2 - The Lexer
Project Overview
Programming assignments 2 through 5 will direct you to design and build
an interpreter for Cool. Each assignment will cover one component of the
interpreter: lexical analysis, parsing, semantic analysis, and
operational semantics. Each assignment will ultimately result in a working
compiler phase which can interface with the other phases.
You may do this assignment in OCaml, Haskell, JavaScript, Python or Ruby.
You must use at least four different languages over the course of PA2 -
PA5.
You may work in a team of two people for this assignment. You may work in a
team for any or all subsequent programming assignments. You do not need to
keep the same teammate. The course staff are not responsible for finding
you a willing teammate. However, you must still satisfy the language
breadth requirement (i.e., you must be graded on a different language
for each of PA2 - PA5).
Goal
For this assignment you will write a lexical analyzer, also called a
scanner, using a lexical analyzer generator. You will
describe the set of tokens for Cool in an appropriate input format and the
analyzer generator will generate actual code (in OCaml, Python, JavaScript
or Ruby). You will then write additional code to serialize the tokens for
use by later interpreter stages.
The Specification
You must create three artifacts:
- A program that takes a single command-line argument (e.g.,
file.cl). That argument will be an ASCII text Cool source file.
Your program must either indicate that there is an error in the input
(e.g., a malformed string) or emit file.cl-lex, a serialized list
of Cool tokens. Your program's main lexer component must be constructed by
a lexical analyzer generator. The "glue code" for processing command-line
arguments and serializing tokens should be written by hand. If your
program is called lexer, invoking lexer file.cl should
yield the same output as cool --lex file.cl. Your program will
consist of a number of OCaml files, a number of Python files, a number of
JavaScript files, or a number
of Ruby files.
- A plain ASCII text file called readme.txt describing your
design decisions and choice of test cases. See the grading rubric. A few
paragraphs should suffice.
- Testcases good.cl and bad.cl. The first should
lex correctly and yield a sequence of tokens. The second should contain an
error.
-
You must use ply or ruby-lex or ocamllex or jison
(or a similar tool). Do not write your entire lexer by hand. Parts of it
must be tool-generated.
Line Numbers
The first line in a file is line 1. Each successive '\n' newline
character increments the line count. Your lexer is responsible for keeping
track of the current line number.
Error Reporting
To report an error, write the string ERROR: line_number: Lexer:
message to standard output and terminate the program. You may
write whatever you want in the message, but it should be fairly indicative.
Example erroneous input:
Backslash not allowed \
Example error report output:
ERROR: 1: Lexer: invalid character: \
The .cl-lex File Format
If there are no errors in file.cl your program should create
file.cl-lex and serialize the tokens to it. Each token is
represented by a pair (or triplet) of lines. The first line holds the line
number. The second line gives the name of the token. The optional third
line holds additional information (i.e., the lexeme) for
identifiers, integers, strings and types. For example, for an integer
token the third line should contain the decimal integer value.
Example input:
Backslash not
allowed
Example .cl-lex output:
1
type
Backslash
1
not
2
identifier
allowed
The official list of token names is:
- at case class colon comma divide dot else equals esac false fi
identifier if in inherits integer isvoid larrow lbrace le let loop lparen
lt minus new not of plus pool rarrow rbrace rparen semi string then tilde
times true type while
In general the intended token is evident. For the more exotic names: at
= @, larrow = <-, lbrace = {, le = <=, lparen = (, lt = <, rarrow
= =>, rbrace = }, semi = ;, tilde = ~.
The .cl-lex file format is exactly the same as the one generated
by the reference compiler when you specify --lex. In addition,
the reference compiler (and your upcoming PA3 parser!) will read
.cl-lex files instead of .cl files.
Lexical Analyzer Generators
The OCaml
lexical analyzer generator is called ocamllex and it comes
with any OCaml distribution.
Haskell uses the Alex lexical
analyzer generator. It comes with the Haskell Platform.
A Ruby lexical
analyzer generator called ruby-lex is available, but you must
download it yourself.
- Alternate
Source. You may have to rename the archive from .tgz to
.tar for it to work correctly.
A JavaScript lexical analyzer
generator called jison is available. You must download it
yourself.
A Python lexical analyzer
generator called ply is available, but you must download it
yourself.
All of these lexical analyzer generators are derived from lex (or
flex), the
original
lexical analyzer generator for C. Thus you may find it handy to
refer to the
Lex paper or the
Flex manual. When you're reading, mentally translate the C code
references into the language of your choice.
My personal opinion is that the OCaml and Python tools are a bit more
mature (i.e., easier to use) than the Ruby and JavaScript tools for this
particular project, but feel free to prove me wrong. In addition, this is
the programming project that will involve the least amount of "native
coding", so if you have a least favorite language of the three you
might consider using it for this project. (Note: Students typically
consider it a mistake, in retrospect, to chose OCaml for PA2 just because
they struggled with it in PA1. OCaml's static types and disjoint unions are
well-suited for PA4 and PA5.)
Commentary
You can do basic testing with something like the following:
For example, if you used OCaml:
You may find the reference compiler's --unlex option useful for
debugging your .cl-lex files.
Need more testcases? Any Cool file you have (including the one you wrote
for PA1) works fine. The ten in the cool-examples.zip file should
be a good start. There's also one among the PA1 hints. You'll want to make
more complicated test cases -- in particular, you'll want to make
negative testcases (e.g., testcases with malformed string
constants).
What To Turn In For PA2
You must turn in a zip file containing these files:
- readme.txt -- your README file
- good.cl -- a positive testcase
- bad.cl -- a negative testcase
- source_files -- including
- main.rb or
- main.py or
- main.hs (and some_file.x, if applicable) or
- main.js (and some_file.jison, if applicable) or
- main.ml and some_file.mll
If your regular expressions and lexer definition are in some other
file (e.g., lexer.mll, lexer.jison, etc.), be sure to
include them!
Your zip file may also contain:
- team.txt -- an optional file listing only the uva
email address of your other team member (see below -- if you are not
working in a team, do not include this file)
Submit the file as you did for PA1.
Working In Pairs
You may complete this project in a team of two. Teamwork imposes burdens
of communication and coordination, but has the benefits of more thoughtful
designs and cleaner programs. Team programming is also the norm in the
professional world.
Students on a team are expected to participate equally in the effort and to
be thoroughly familiar with all aspects of the joint work. Both members
bear full responsibility for the completion of assignments. Partners turn
in one solution for each programming assignment; each member receives the
same grade for the assignment. If a partnership is not going well, the
teaching assistants will help to negotiate new partnerships. Teams may not
be dissolved in the middle of an assignment.
If you are working in a team, exactly one team member should submit
a PA2 zipfile. That submission should include the file team.txt, a
one-line, one-word flat ASCII text file that contains the email address of
your teammate. Don't include the @virgnia.edu bit. Example: If
ph4u and wrw6y are working together, ph4u would
submit ph4u-pa2.zip with a team.txt file that contains
the word wrw6y. Then ph4u and wrw6y will both
receive the same grade for that submission.
This seems picayune, but in the past we've had students fail to correctly
format this one word file. Thus you now get a point on this
assignment for either formatting this file correctly (i.e., including only
a single word that is equal to your partner's uva email ID) or not
including it (and thus not working in a pair).
Autograding
We will use scripts to run your program on various testcases. The testcases
will come from the good.cl and bad.cl files you and your
classsmates submit as well as held-out testcases used only for grading.
Your programs cannot use any special libraries (aside from the OCaml
unix and str libraries, which are not necessary for this
assignment). We will use (loosely) the following commands to execute them:
- ghc --make -o a.out *.hs ; ./a.out testcase.cl >& testcase.out
- node main.js testcase.cl >& testcase.out
- ocamlc unix.cma str.cma *.ml ; ./a.out testcase.cl >& testcase.out
- python main.py testcase.cl >& testcase.out
- ruby main.rb testcase.cl >& testcase.out
You may thus have as many source files as you like (although two or three
plus your lexer definition should suffice) -- they will be passed to your
language compiler in alphabetical order (if it matters). Note that we
will not run the lexical analyzer generator for you -- you should run it
and produce the appropriate ML, Python, JavaScript or Ruby file and submit
that.
In each case we will then compare your output to the correct answer:
- diff -b -B -E -w testcase.cl-lex correct-answer.cl-lex
If your answer is not the same as the reference answer you get 0
points for that testcase. Otherwise you get 1 point for that testcase.
For error messages and negative testcases we will compare your output
but not the particular error message. Basically, your lexer need
only correctly identify that there is an error on line X. You do not have
to faithfully duplicate our English error messages. Many people choose to
(because it makes testing easier) -- but it's not required.
We will perform the autograding on some unspecified test system. It is
likely to be Solaris/UltraSPARC, Cygwin/x86 or Linux/x86. However, your
submissions must officialy be platform-independent (not that hard
with a scripting language). You cannot depend on running on any particular
platform.
There is more to your grade than autograder results. See the Programming
Assignment page for a point breakdown.
Your submission may not create any temporary files. Your submission may not
read or write any files beyond its input and output. We may test your
submissioa in a special "jail" or "sandbox".