Inf2A 2018 19: Assignment 1
The Language Processing Pipeline for Micro-Haskell
Issued 12 October 2018
The deadline for this assignment is 4pm, Tuesday 30 October 2018 (the night before
Hallowe’en).
Overview
The objective of this practical is to illustrate the language processing pipeline in the
case of a small but interesting fragment of the Haskell programming language, which we
shall call Micro-Haskell (or MH). The practical is self-contained, and prior knowledge of
Haskell itself is not assumed; however, those with no previous experience of Haskell may
need to invest a little more time to understand what is required. The brief introduction
to Micro-Haskell to be given in Lecture 14 may also be helpful.
The practical illustrates all stages of the language processing pipeline for program-
ming languages, taking us from a source le, written in MH, to execution of the program.
In our case, the pipeline has four stages: lexing, parsing, typechecking and evaluation.
Your task is to provide language-speci c material to assist with the rst three stages:
lexing, covered in Part A of the practical; parsing, covered in Part B; and typechecking,
covered in Part C. A simple evaluator is provided for you.
The code implementing all four stages will itself be written in Java. (So you are
warned at the outset that you’ll need to think in two languages at once!) Several
general-purpose components are provided; your task is to supply the parts speci c to
MH, by applying your understanding of the course material. Once you have nished,
you’ll be able to hook up your implementations to the given evaluator to obtain a
complete working implementation of MH.
Of course, many libraries of lexing and parsing tools in Java are available on the
Internet, but the point of this practical is to build things up from rst principles in
order to understand how they work. You should therefore not attempt to use any
lexer or parser utilities you may nd online, nor any tools such as StringTokenizer or
StreamTokenizer that might otherwise appear tempting.
1
Besides writing your own code, you are advised to spend some time understanding
what the provided parts of the code are doing, at least in broad terms. This will help
you to understand what you need to implement, and also to see how the various stages
of the pipeline all t together.
The version of Java for this practical is 1.8.0 181 (the version currently installed on
the DICE machines). If you have another version installed on a machine of your own,
you may still be able to complete the practical with this, but it’s your responsibility
to check that the nal version of your code compiles and runs under the above version
before you submit it.
Instructions
To begin, download the code le Inf2A_Prac1_Files.zip from the Informatics 2A
Assignments webpage. On a DICE machine, this can be unpacked using
unzip Inf2A_Prac1_Files.zip
This will build a subdirectory Inf2A_Prac1_Files inside your working directory (folder),
within which you will nd all source les needed for the assignment. Look rst at the
le MH_example.txt, which provides a small sample of Micro-Haskell.
The assignment comprises four parts: A{D. In each of Parts A{C, you have to
complete a template Java le that is provided for you. In doing this, ll in only the
parts of code that you are instructed to write. Do not make any other alterations
to the existing code | this will invalidate your solution! You are required to
submit your solutions from a DICE machine, using the submit command, as speci ed
below.
Part Submit command Marks
A submit inf2a cw1 MH_Lexer.java 35
B submit inf2a cw1 MH_Parser.java 30
C submit inf2a cw1 MH_Typechecker.java 25
D no further submission required 10
It is also possible to submit your solutions to parts A{C simultaneously:
submit inf2a cw1 MH_Lexer.java MH_Parser.java MH_Typechecker.java
Part D combines Parts A{C with the evaluator to create a full working implementation
of MH; it does not require any submission of its own.
2
Part A: Lexing in Micro-Haskell [35 marks]
In this part, we construct a lexical analyser for MH. A general-purpose longest-match
lexer is already provided. Your task is to supply deterministic nite-state machines that
serve as recognizers for the various lexical classes in the language.
Look at the provided le GenLexer.java. It begins with some Java functions that
de ne certain useful classes of characters: letter;small;large;digit;symbolic;whitespace;
newline. Next comes a Java interface DFA which de nes the functionality that any -
nite state machine has to provide. Some of this is provided in the class Acceptor which
follows, but notice that this class contains stubs for ve ‘abstract’ methods whose imple-
mentation will be speci c to the particular DFA in question. There then follow three ex-
amples of how to construct implementations of particular DFAs: EvenLetterAcceptor,
AndAcceptor and SpaceAcceptor. Later in the le, the class DemoLexer illustrates how
these DFAs may be combined to yield a lexer for a simple (and silly) language, and the
class LexerDemo gives you a simple way of trying this out (the comments in the source
le explain how).
Notice that states are represented by integers, with 0 as the initial state. Besides
the transition operation and the set of accepting states, our DFAs here must also be
equipped with a designated dead (or \garbage") state: that is, a non-accepting state
from which no sequence of transitions can lead to an accepting state. Note also that our
DFAs must include the method String lexClass(), which provides the name of the
lexical class they are associated with. This is done because we wish our lexer to output
a stream of tokens each tagged with their lexical class.
Your objective in Part A is to implement DFAs in the same style. corresponding to
the lexical classes of Micro-Haskell. This is to be done in the le MH_Lexer.java, which
currently provides a template containing some gaps for you to ll in. For the rst six of
these gaps, follow the pattern of the examples in GenLexer.java to construct DFAs for
the following lexical classes de ned by regular expressions. (These correspond closely to
lexical classes of actual Haskell, except that we’ve chosen a slightly di erent de nition
of the class NUM.)
• A class VAR of variables, de ned by
small (small + large + digit+')
• A class NUM of numeric literals, de ned by
0 + nonZeroDigit digit
where nonZeroDigit means what you think it does.
3
• A class BOOLEAN of boolean literals, de ned by
True + False
• A class SYM of symbolic tokens, de ned by
symbolic symbolic
• A class of whitespace elements, de ned by
whitespace whitespace
• A class of comments, de ned by
--- (nonSymbolNewline nonNewline + ")
where nonSymbolNewline is the set of all characters except those of symbolic or
newline, and nonNewline is the set of all characters except those of newline. Note
that --- e ectively means ‘two or more dashes’.
The names of the last two classes, implemented by the lexClass() method, should both
be the empty string. This will notify the lexer that tokens of these classes should be
discarded.
In addition to these classes, keywords such as if and special symbols such as ; will
require ‘singleton’ (i.e. one-element) lexical classes all to themselves. For this purpose,
we will provide a class tokAcceptor which, for any string tok that we supply, can
create a special DFA that accepts the string tok and no other strings. For instance, the
constructor call tokAcceptor("if") should create a DFA that accepts only the string
"if". Fill in the gap in the implementation of this class so as to achieve this behaviour.
(Note that this is quite di erent from the other classes above, in that we will be able to
create several objects of class tokAcceptor each implementing a di erent DFA.) Here
the name of the lexical class should be identical to the string itself | this will serve to
make the speci c token we are dealing with visible to the parser.
The lexical classes we require for MH are now the six lexical classes listed above,
together with singleton classes for the ve keywords Integer, Bool, if, then, else and
for the three special symbols (, ), ;.
Following the example of class DemoLexer in the le GenLexer.java, add a few
lines of code to construct acceptors for these fourteen classes, and put these together
in an array called MH_acceptors. The acceptors should be listed in order of priority,
4
with highest priority rst, which should be sensibly chosen so that keywords like if are
assigned an appropriate lexical class (so as to emulate the behaviour of an actual Haskell
implementation).
The MH_acceptors array is fed to a general-purpose routine that performs longest-
match lexing (also known as maximal munch) using the method described in Lecture
7. Take a brief look at the code for this in GenLexer.java, and check that you broadly
understand what it is doing.
You should now be able to compile GenLexer.java and your le MH_Lexer.java to
create a lexer for MH. To test your lexer, you might wish to adapt the LexerDemo code
in GenLexer.java; this will allow you to create a simple command-line driven lexical
analyser for MH. You are not required to submit this test code, however.
Before leaving the topic of lexing, take a quick glance at the code provided in
CheckedSymbolLexer.java. This performs some mild post-processing on the stream
of lexical tokens: symbolic tokens are checked to ensure that they are among the tokens
that feature in MH:
:: -> = == Type
Type0 ! Integer j Bool j ( Type )
TermDecl ! VAR Args = Exp ;
Args ! j VAR Args
Exp ! Exp0 j if Exp then Exp else Exp
Exp0 ! Exp1 Rest0
Rest0 ! j == Exp1 j = == Type
Look at the le MH_Type_Impl.java, which de nes a Java representation of MH types
in this stripped-down form. The interface MH_TYPE declares various operations one can
perform. on such types (check that you understand what they are intended to do), while
further down, the class MH_Type_Impl provides prede ned constants for the MH types
Integer and Bool, as well as a constructor for building a function type (i.e. an arrow
type) from two previously existing MH types. In the typechecking code you will be
writing, these may be utilized as follows:
MH_Type_Impl.IntegerType ; // AST for Integer
MH_Type_Impl.BoolType; // AST for Bool
new MH_Type_Impl (t1,t2); // AST for (t1->t2)
Clearly, we will need a way to convert syntax trees as produced by the parser
into ASTs of this kind. This is done by the provided methods convertType and
convertType1 in MH_Type_Impl.java. A good warm-up to your own task would be
to try and understand the workings of convertType and convertType1 with the help
of the comments provided.
A similar notion of abstract syntax trees is also required for expressions (trees with
topmost label #Exp). In e ect, ASTs for expressions are just trees for the simpli ed
grammar:
Exp ! VAR j NUM j BOOLEAN j Exp Exp j Exp in x Exp j if Exp then Exp else Exp
where in x ranges ver ==;. Type in an
expression you would like to evaluate, and hit return. The expression may involve the
functions declared in your MH program. Do this as many times as you like; e.g.:
MH> 3+6
...
MH> fib 7
...
MH> gcd 104 (fib 12)
...
To quit, hit CTRL-c.
This last part of the assignment does not require you to do any further coding. Your
solutions to Parts A{C will be combined with the evaluator, and the resulting complete
implementation of MH will be marked for the correctness of its behaviour on a test
suite of ve MH programs (di erent from those in MH_example.txt). However, in order
to gain assurance that your solutions to A{C will indeed combine to yield a correct
implementation of Micro-Haskell, you should spend a little time testing your complete
system on some small MH programs of your own devising (there is a certain amount of
fun to be had from this in any case). You should also be sure to test your system on a
sample of incorrect programs, to ensure that an appropriate error message is raised by
the lexer, parser or typechecker as appropriate.
Support for this assignment
The optional lab sessions in Weeks 5 and 6 will o er support for this assignment: lab
demonstrators will be on hand to help you, particularly with any Java programming
issues.
You may also post queries to the Piazza forum, but we would ask you to do so rather
sparingly: in some previous years, the volume of queries has been overwhelming for sta
and students alike! Please respect the following guidelines:
1. Do not post a query to the forum unless you have already spent a signi cant time
(say 40 minutes) trying to answer it yourself | by studying this handout or the
lecture slides, by searching the Web for relevant help, etc.
11
2. Take care not to post any part of an actual solution | not even one line of code!
3. Feel free to answer queries from other students (whilst observing 2). This is much
appreciated by sta and will often mean the question gets answered more promptly.
4. Don’t post a query without rst checking whether the same query has already
been posted and answered.2
5. Do keep your postings polite and respectful.3
Note on marking
Your submission will be marked by a human marker with the assistance of some au-
totesting programs that will run your code on various test examples. If your code passes
all or most of these tests, the human marker’s job will be easy; but if not, the marker
may need to inspect your code to assess your level of understanding. It is therefore in
your interests to include at least a few comments in your code to show that you know
what you are doing (and commenting your code is good practice in any case).
Most critically, please, please ensure that your code compiles correctly (in conjunc-
tion with the other source les provided) before you submit it! Very few marks will be
available for submissions that do not compile.
Mary Cryan, October 2018
2It’s in everyone’s interests to keep the forum tra c manageable: in previous years, the number of
postings was so daunting as to discourage students from scanning them all | they would then post
duplicate queries and the problem would snowball.
3Even though you are posting anonymously to your classmates, Shay and I can request information
about posters if the facility is abused!