FIT3155 S2/2020: Assignment 2
(Due midnight 11:59pm on Fri 9 October 2020) [Weight: 20 = 7 + 3 + 3 + 3 + 4 marks.]
Your assignment will be marked on the performance/efficiency of your program. You must
write all the code yourself, and should not use any external library routines, except those
that are considered standard. The usual input/output and other unavoidable routines are
exempted.
Follow these procedures while submitting this assignment:
The assignment should be submitted online via moodle strictly as follows:
• All your scripts MUST contain your name and student ID.
• Use gzip or Winzip to bundle your work into an archive which uses your student ID
as the file name. (STRICTLY AVOID UPLOADING .rar ARCHIVES!)
– Your archive should extract to a directory which is your student ID.
– This directory should contain a subdirectory for each of the four questions, named
as: q1/, q2/, q3/ and q4/. – Your corresponding scripts and work should be tucked within those subdirectories.
Note: If you have scripts that are common to multiple questions, such as your
suffix tree construction, you may include the script in the parent directory, or
copy it into the relevant subdirectories.
• Submit your zipped file electronically via Moodle.
Academic integrity, plagiarism and collusion
Monash University is committed to upholding high standards of honesty and academic
integrity. As a Monash student your responsibilities include developing the knowledge and skills to avoid plagiarism and collusion. Read carefully the material available
at https://www.monash.edu/students/academic/policies/academic-integrity to understand
your responsibilities. As per FIT policy, all submissions will be scanned via MOSS.
1
Assignment Questions
Questions 1 and 2 should be addressed using a suffix tree. As always, marking will be based
on the efficiency of your suffix tree construction, and those of the tasks below.
Mark distribution: 7 marks for suffix tree construction, 3 marks for each of tasks 1, 2, and
3, and 4 marks for task 4, giving 20 marks in total.
1. (3 marks) Given a string str[0 . . . nn1], write a program that constructs its suffix tree,
and using that suffix tree, outputs the Burrows-Wheeler Transform (BWT) of the string.
Note: The BWT was covered in FIT2004.
Strictly follow the following specification to address this task:
Program name: suffixtree2bwt.py
Argument to your program: An input file containing str[0 . . . n n 1]. It is safe to
assume that:
• str consists of lower case English characters, i.e. ASCII characters with values
in the range [97 . . . 122].
• There are no line breaks in the input file, and that all characters str[0 . . . nn1]
are lexicographically larger than the terminal character, $, which you will append
to str[0 . . . n n 1] after reading it from the file.
Command line usage of your script:
suffixtree2bwt.py
Do not hard-code the filename in your program.
Output file name: output bwt.txt
• Output format: The output file should contain (without line breaks) the BWT
of str[0 . . . n n 1]$.
Example: If str[0 . . . 10]$ = mississippi$
the output would be:
ipssm$pissii
2. (3 marks) Given two strings S1[0 . . . n n 1] and S2[0 . . . m m 1], write a program that
uses a suffix tree to compute the longest substring that is common to both S1 and S2.
You should output the length of the substring and the positions i and j that denote
the beginning of the substring in S1 and S2 respectively. You should only report the
leftmost occurrence in each of S1 and S2 for each distinct substring that satisfies the
aforementioned conditions. You may assume that S1 is the longer string i.e. n ≥ m.
Hint: You may find it helpful to first determine how to compute ls(i) for each position
0 ≤ i ≤ nn1 in S1, where ls(i) is defined as the length of the longest prefix of S1[i...nn1]
that matches a substring of S2. 2
Strictly follow the following specification to address this task:
Program name: lcs.py
Arguments to your program: Two plain text files:
(a) an input file containing S1[0...n n 1] (without any line breaks).
(b) another input file containing S2[0...m m 1] (without any line breaks).
It is safe to assume that:
• S1 and S2 consist of lower case English characters, i.e. ASCII characters with
values in the range [97 . . . 122].
• All characters are lexicographically larger than the terminal character, $, which
you may append to either, or both of S1 and S2.
Command line usage of your script:
lcs.py
Do not hard-code the filename in your program.
Output file name: output lcs.txt
• The first line of the file should contain the length of the longest substring that
is common to both S1 and S2.
• Each line following this should consist of a single pair of integers, i and j (space
separated) that denote the beginning of the corresponding substring(s) in S1
and S2 respectively, in the following format:
i j
Example: If S1[0 . . . 8] = abcxabcde and S2[0 . . . 5] = abcbcd
the output should be:
3
0 0
5 3
As abc and bcd are unique substrings that both have length 3 and are common
to S1 and S2, your program should write a line for each of them. However, as you
are only required to report the leftmost occurrence of each substring and the second
occurrence of abc in S1 does not need to be output.
3
3. (3 marks) Recall that a spanning tree of a graph G(V, E) is a subgraph T consisting of
all the vertices in V and some subset of edges E0 ⊂ E such that T is a tree. That is, a
spanning tree is a collection of edges that form a tree and connect all the vertices in V .
A spanning tree in a weighted graph is called a weighted spanning tree and a weighted
spanning tree of minimum total possible weight is called a minimum spanning tree.
Given a weighted undirected graph G(V, E), your task is to write a program that uses
Kruskal’s algorithm (see notes from FIT2004) to find a minimum weight spanning tree in
G. You may assume that G is connected.
Strictly follow the following specification to address this task:
Program name: kruskals.py
Arguments to your program: The total number of vertices |V | (the size of G’s vertex
set) as a positive integer, and a plain text file. Every line in the text file contains
single edge with weight w connecting vertex u to vertex v in G represented by 3
integers in the following format:
u v w
The integers u and v are integers in the range [0 . . . |V | − 1]. Each of the integers in
the line are separated by a single space.
Command line usage of your script:
kruskals.py <|V|>
Do not hard-code the filename in your program.
Output file name: output kruskals.txt
Output format: The output file should contain:
• The total weight (as an integer) of a minimum spanning tree in G on the first
line.
• The edges of the minimum weight spanning found by your program. There
should be one edge per line, (starting from the second line) in the same format
as the input edge file.
Example: In the example below the minimum weight spanning tree (shown in red) has
weight 1 + 4 + 6 + 2 + 1 + 2 = 16. In this case the usage of your program would be:
python kruskals.py 7 G.txt
Where G.txt would contain:
0 1 5
1 2 4
2 3 2
3 4 10
4 5 6
5 0 1
0 6 8
1 6 2
2 6 1
3 6 3
4 6 7
4
5 6 4
Finally, the output of your program in this case would be a file output kruskals.txt
containing the weight of the minimum spanning tree (16) on the first line, followed
by the edges contained in the tree:
16
5 0 1
4 5 6
5 6 4
1 6 2
2 6 1
2 3 2
5
4. (4 marks) Imagine that collection of adventurers are all racing through a labyrinth towards
the same treasure. Each adventurer has a map of the labyrinth and a unique entry point.
Although none of the adventurers are sure that their map details the entire labyrinth they
all independently decide that the best course of action is to follow the shortest path to
the treasure according to their map. The first adventurer to reach the treasure claims
it and is allowed to leave unimpeded.
Owing to the uncertainty of each of the adventurers, if any two of them cross paths
on their way to the treasure they decide to join forces and combine their maps in the
hope that it will give them a better chance of making it to the treasure first. A pair of
adventurers will always agree to split the spoils before quickly re-evaluating the shortest
path based on their combined knowledge and setting off together. However, reluctant
to split the treasure any further, a pair of adventurers will refuse to help any other
adventurers they encounter on their journey and both parties will stubbornly return to
following their respective paths.
Given a set of weighted directed graphs {G0(V0, E0), G1(V1, E1), . . . , Gkk1(Vkk1, Ekk1)}
that represents the adventurers’ maps, and a set of vertices {S0, S1, . . . Skk1} that represent the corresponding starting points, your task is to write a program to determine
which adventurer(s) reaches the treasure (a vertex T) first. Note that the edges represent the corridors through the labyrinth and their associated weights indicate how long
it takes an adventurer to traverse that corridor. You may make the following simplifying
assumptions:
• All edge weights in the problem are strictly positive. • T is reachable from Si
in Gi
for all 0 ≤ i ≤ k k 1, i.e. each of the k adventurers is
able to reach the treasure from their starting point using just their original map.
• If multiple adventurers (including multiple pairs of adventurers) reach T at the same
time, they all split the treasure, i.e. they are all considered winners.
• All starting points {S0, S1, . . . Skk1} are unique and T is not any of these vertices,
i.e. no two adventurers start from the same point and no one starts at the treasure.
• The graph G(V, E) = G0 ∪ G1 ∪ . . . ∪ Gkk1 that represents the entire labyrinth
contains |V | vertices that are numbered 0, 1, 2, . . . |V | − 1.
• If the edge hx, yi is present in both Gi and Gj
it has the same weight in both graphs
for all 0 ≤ i, j ≤ kk1. That is, it takes the same amount of time for every adventurer
to traverse a given edge.
• Pairs of adventurers move at the same rate as a single adventurer.
• If there are multiple shortest paths to T an adventurer will choose one of them
arbitrarily, i.e. it does not matter how you break the tie.
• Adventurers can only meet at vertices.
• If three (or more) adventures all meet at once they form pairs arbitrarily until
the maximum possible number of pairs has been formed. For instance, if three
adventurers meet then two will become a pair and the third will remain alone.
Similarly, if four adventurers all meet then two pairs will be formed.
6
Strictly follow the following specification to address this task:
Program name: adventure.py
Arguments to your program: The total number of vertices |V | (the size of G’s vertex set) as a positive integer, the treasure vertex (T) as an integer in the range
0 . . . |V | − 1, and k plain text files (one for each adventurer). The first line in file i
(0 ≤ i ≤ k k 1) contains a single integer in the range [0 . . . |V | − 1] that denotes the
start vertex of adventurer i. Every line afterwards contains single edge with weight
w directed from vertex u to vertex v in Gi represented by 3 integers in the following
format:
u v w
The integers u and v are integers in the range [0 . . . |V | − 1], while w > 0. Each of
the integers in the line are separated by a single space.
Command line usage of your script:
adventure.py <|V|> . . .
Do not hard-code the filename in your program.
Output file name: output adventure.txt
Output format: The output file should contain:
• The length (as an integer) of the path traversed by the fastest adventurer(s) on
the first line.
• The path(s) traversed by the victorious adventurer(s). Your program should
write each path (one per line) as a sequence of space separated integers (representing the vertex labels in the path) beginning with the start vertex of the
corresponding adventurer.
Example: In the example below adventurer 0 starts at vertex 0 and initially has the
shortest path to vertex 6 (T) (shown in red) of length 6. Similarly, adventurers 1
and 2 start at vertices 2 and 1 respectively and have paths to T of length 7. However,
we can see that adventurers 1 and 2 will meet at vertex 3 and their combined maps
allow them to reach T using paths of length 5, and so they will reach the treasure
before adventurer 0. In this case the usage of your program would be:
python adventure.py 7 6 G0.txt G1.txt G2.txt
As an example G0.txt would contain: