首页 > > 详细

辅导INF2B留学生、讲解Java编程语言、Edinburgh讲解、辅导Java设计 辅导R语言编程|讲解SPSS

COURSEWORK 1 FOR INF2B (ADS THREAD, 2018-19)
STRING MATCHING
Submission Deadlines: The coursework consists of two parts (of a different nature) relating to
one problem. As shown below these have separate deadlines, you cannot swap the parts round
or submit either part later than the stated deadline (unless you have permission to do so from the
yeare organiser).
Part 1 consists of Exercise 1 on page 4, §2.1 and Exercise 2 on page 9, §2.2; these involve
fairly straightforward analysis.
Submit Part 1 by: 4.00PM, 25 February 2019.
Part 2 consists of the tasks in §3; these are all software related.
Submit Part 2 by: 4.00PM, 08 March 2019.
See §4 at the end of this document for instructions on how to submit.
§1. Introduction
In this practical we will consider two methods of searching for all occurrences of a given pattern
within some text (both given as strings). For example we might have a long piece of English text
about Scotland and we wish to search for all occurrences of “Edinburgh” within it. Our aim is to
return the offset position of all occurrences, if there are no occurrences then we indicate this by
a special convention discussed below.
The practical has several aims:
1. Practice at asymptotic analysis (with guidance).
2. Careful implementation of one algorithm (an implementation of the other algorithm is
supplied).
3. Carry out timing experiments in order to:
(a) Compare the algorithm that you implement with one that is supplied and decide the
point from which one is more efficient than the other (i.e., the overheads are overweighed
by the advantages).
(b) Determine the constant for the asymptotic analysis as it applies to your particular
implementation.
You can concentrate on Part 1 (up to and including §2) first and then on the rest. However it
would be a good idea to read the entire document through initially (without going into great
detail) so that you know what is expected for the complete coursework. Note that although the
document is fairly long the tasks you are required to carry out are quite modest. The emphasis
here is on understanding the problem well and so plenty of discussion has been included.
1
The practical requires you to submit various things. Each of these has some appropriate
marks assigned. It would however be a serious error to assume that by carrying out only these
tasks you have obtained the most (or indeed the main) benefit from the practical. Throughout
the text you are prompted to think about various matters that are not part of the submission. You
should consider these carefully along with possibly other issues as they occur to you. In other
words avoid the pitfall of just going through the motions in a mechanical fashion; in any case this
is not possible for most parts. The practical is issued to help you develop your understanding of
the material, the marks are a by-product (of course an important one) and far from being its only
rationale.
Notes
Good Scholarly Practice: Please remember the University requirement as regards all assessed
work for credit. Details and advice about this can be found at:
http://web.inf.ed.ac.uk/infweb/admin/policies/academic-misconduct
and links from there. Note that, in particular, you are required to take reasonable measures to
protect your assessed work from unauthorised access. For example, if you put any such work
on a public repository then you must set access permissions appropriately (generally permitting
access only to yourself, or your group in the case of group practicals).
Following instructions: The various parts of this practical place certain requirements with a
penalty if these are not followed. No negotiation of any kind will be entered into where the
requirements are not followed. The point of the requirements is to ensure that in doing the
various parts you practice and demonstrate understanding of the appropriate areas of the course.
What you submit must be considered work; imagine that your work was being given to you and
others as a sample answer. If you, or others, would find it unclear and unhelpful then your work
needs further revision.
Regrettably the various requirements and penalties can appear somewhat censorious and limiting.
This is not the intention, indeed much of what is stated can be seen in the much more
positive light of reassurance that the various tasks have quite simple answers. Please be assured
that plenty of variation is possible even when following the requirements, i.e., there isn’t necessarily
a unique way to do and present things clearly, concisely and correctly.
Obtaining Help: If, after careful study and deep thought, you need help then use the forum for
questions that you will find on the link at:
http://www.inf.ed.ac.uk/teaching/courses/inf2b/coursework/cwk1.html.
Note that this forum is for questions on this practical only. Any questions must be posted here
and not sent by email. The TA for this thread will monitor the forum at various times. You should
allow at least 2 days for a response.
2§2. String matching
Our discussion is based on the chapter on string matching from the book Introduction to Algorithms
by Cormen, Leiserson, Rivest and Stein. We are interested in strings over some fixed
alphabet (which we need not worry about at this stage, e.g., it could be ASCII characters). We
are given some text which is simply a string over the alphabet. We will represent the text by
means of an array T which we will assume is indexed starting with 1 (this keeps some of the operations
more straightforward1). Our convention will be to think of the text as the array T[1 ..n],
i.e., the array has n entries and T[i] gives us the ith character of the text. We are also given a
pattern which is simply another string over our alphabet. We will always represent the pattern
by an array P[1 ..m]. If m>n then the problem is trivial as there can be no occurrences of the
pattern within the text. We will therefore assume throughout that m ? n.
Before going any further let us look at a simple example. Suppose the text is bacababa and the
pattern is bab, so that n = 8 and m = 3. We can think of their representations diagrammatically
as follows:
text T b a c a b a b a
pattern P b a b
Clearly the pattern occurs in the the text starting at position 5 but nowhere else.
We will adopt the following convention. We say that the pattern P occurs in the text with
shift (or offset) s if and only if
P[1 ..m] = T[s + 1 ..s + m]. (1)
The equation is to be understood as stating that P[i] = T[s + i], for 1 ? i ? m. One advantage
of this definition is that it can be expressed as P[t..m] = T[s + t..s + m] where t is the base
address of our arrays (so in this handout we have t = 1 and in Java t = 0). The point is that the
offset is independent of t.
Our problem is to find all shifts s such that equation (1) holds. We will return these shifts as a
list in increasing order. Thus the pattern occurs in the text if and only if the list is not empty. So
for our example we would return a list of one cell containing the number 4, i.e., the shift (note
that the shift is one less than the position at which a pattern occurs in the text since we number
array locations from 1 onwards [what happens if we number them from 0?]). We can think of our
list as a queue, though we only ever add items to it (of course subsequent use of the list would
remove items to find out where the pattern occurs though we will not do this).
§2.1 Naive algorithm
There is an obvious algorithm for our problem. We can think of it in terms of placing the pattern
under the text in successive positions (starting with 1) and testing to see if the pattern matches
the text immediately above it.
1Although you are probably used to starting arrays at 0 it is an important part of your training to be able to follow
and adapt to other conventions that are in use. Demonstrating this ability is part of this assignment
3
Naive-Matcher(T,P)
1. n T.length
2. m P.length
3. create a new empty queue S
4. for s 0 to n
m do
5a. if P[1 ..m] = T[s + 1 ..s + m] then enqueue(s, S)
Of course the test P[1 ..m] = T[s + 1 ..s + m] has to be implemented as a loop, this has not
been shown above for the sake of clarity. Line 5a should be seen as shorthand for the following
code:
5. matches TRUE
6. for i 1 to m do
7. if P[i] 6= T[s + i] then
8. matches FALSE
9. exit for loop
10. if matches then enqueue(s, S)
The lines have been numbered so that this can be slotted into the preceding pseudocode and in
your analysis assume that this has been done so you can refer to the line numbers as given
We have also shown S as a parameter to enqueue for the sake of clarity, if implemented
in Java then we make the appropriate method call. Another point to note is that, as mentioned
above, our convention here is to number arrays starting at 1 whereas Java starts at 0. We will
discuss this below, for now let us not worry about the implementation language.
Exercise 1. For item 1 you must use the O(·) notation along with any appropriate properties as
discussed in Lecture Note 2. You may not use named constants as part of the analysis. For item 2
you could use named constants but in fact this is not necessary, the ?(·) notation enables things to
be expressed simply and well. Finally, note that for full marks you must justify each expression
or claim with reference to the algorithm. Any answers that do not follow these requirements will
be awarded 0.
Let TNM(n, m) denote the worst case runtime of the algorithm Naive-Matcher on inputs of
size n and m respectively.
1. Prove that TNM(n, m) = O((n
m + 1)m). Note that the additive constant +1 is needed
to take care of the case when m = n. [15%]
2. Prove that TNM(n, m) = ?((n
m + 1)m). Thus you must produce inputs of size n,
m that cause the algorithm to have runtime as claimed. You cannot fix either of n or m.
(HINT: There are very simple examples.) [10%]
3. Is it true to say that TNM(n, m) = ?((n
m + 1)m)? Justify your answer briefly. [5%]
Note: The second part of the exercise is a good opportunity to underline the important point
that for many algorithms the worst case runtime for given input size(s) does not happen for all
possible inputs but only for some appropriate ones. As an example here, if we take the text
4to consist entirely of the character a and the pattern to consist of anything starting with the
character b then the runtime is O(n). We saw the same point in connection with the simple linear
search algorithm.
§2.2 The Knuth-Morris-Pratt algorithm
Let us look again at the simple example. In trying to see if there is a match of the pattern with
the text starting at position 1, the first two characters match but the third fails. Our knowledge of
the text can be pictured as:
text T b a ? ...
pattern P b a b
(It is true that we looked at the next character in the text but we want to relate our knowledge
of the text to the pattern, taking the extra character into account does not help in the following
discussion.) This tells us immediately that it is not worth trying a match of the pattern with the
text starting at position 2, since the first character of the pattern fails to match its second character.
On the other hand it is worth trying position 3. This observation (and its generalisation below) is
crucial for the development of the algorithm.
Consider the general situation where we are comparing the pattern against the text starting at
position s+ 1 (recall that s is the shift from position 1 of the text). We have a match exactly up to
position q of the pattern (i.e., position s+q of the text). We allow q = 0 to mean there is no match
at all (this makes sense, q is the number of characters matched). We have found a complete match
of the pattern in the text if and only if q = m otherwise there is a mismatch with the character at
position q + 1 of the pattern. Whatever is the case we now know that T[s+ 1 ..s+q] = P[1 ..q].
Given this knowledge, we can decide just from the pattern if it makes sense to try matching it
with the text starting at position s + 2. We do this by comparing the pattern with itself thus:
a1 a2 a3 a4 ... aq
a1 a2 a3 ... aq1
We see that it is only worth trying a match with the text at the next position if and only if the
match above is successful (this is the same as saying that T[s+2 ..s+q] = P[1 ..q1]).
Suppose
that this fails, i.e., the shift of the pattern does not match with the pattern up to position q. We
can again decide just from the pattern if it is worth trying at the next position. Our diagram is:
a1 a2 a3 a4 ... aq
a1 a2 ... aq2
Here we need T[s + 3 ..s + q] = P[1 ..q
2].
The preceding discussion shows that what we need is the least shift s0 > s such that
T[s0 + 1 ..s + q] = P[1 ..s + q
s0
].
5
We can rephrase this as: the least shift s0 > s such that
T[s0 + 1 ..s + q] = P[1 ..k],
where s0 + k = s + q. We can also write the equation as T[s0 + 1 ..s0 + k] = P[1 ..k] [why?].
The crucial point here is that we can pre-compute these shifts from the pattern alone (recall that,
by assumption, T[s0 + 1 ..s + q] consists of the first q characters of the pattern). An equivalent
interpretation is to find the largest kwe can eliminate s0 altogether and write the equation as
T[s + q
k + 1 ..s + q] = P[1 ..k]. (?)
(It is important to get the indices right but don’t let this disguise the main idea which is quite
simple and intuitive.) Note that such a k always does exist since k = 0 satisfies the condition
(vacuously) except that it might not be the largest. On the other hand the requirement kmeans that the possible values of k are bounded from above and hence there is a maximum.
At this point it might be helpful to recall that our discussion is based on the assumption that
T[s + 1 ..s + q] = P[1 ..q]. It follows that the equation (?) above is equivalent to
P[q
k + 1 ..q] = P[1 ..k]
since T[s + q
k + 1 ..s + q] = P[q
k + 1 ..q]. In terms of a diagram the last displayed
equation states
a1 a2 ... aqk+1
aqk+2
... aq
a1 a2 ... ak
So once we fix q we can find k just from the pattern P.
We define the function ? : {1, 2,...,m} ! {0, 1,...,m
1} by requiring that ?(q) is the
value k as just described. (We do not need to define ?(0) because q = 0 means there was no
match at all and so we will automatically move on. ) Note that the function ? can be represented
by an array indexed from 1 to m and we will use this without further comment. From the
preceding discussion we know that
0 (q) < q
for all q. Make sure that you understand fully the meaning of this function and how it relates to
the discussion above. The following example will help.
The function for our simple pattern bab is given by:
q 1 2 3
(q) 0 0 1
Let us see exactly why these values are correct by looking at them in turn. Our discussion will
refer to T (to help you relate the calculations to their intended use) but remember that this is not
necessary for the calculations.
6(1) : Here T[s + 1 ..s + 1] is the same as the pattern entries P[1 . . 1], i.e., T[s + 1] is the same
character as P[1]. We want the largest k such that P[1 ..k] = T[s0 + 1 ..s0 + k] where
s0 + k = s + 1. Since we require s0 > s this immediately means that s0 = s + 1 and k = 0.
Note that this argument applies to any pattern, we always have ?(1) = 0. It is exactly what
we would expect: we are told that the attempt to match the pattern with the text starting
at position s + 1 failed at position s + 2 (i.e., T[s + 2] 6= P[2]). It follows that the next
position at which we should try a match is at s + 2.
(2) : We have T[s + 1, s + 2] = P[1, 2]. Is it worth trying for a match at position s + 2 of the
text? Well such an attempt would look like:
Clearly this is bound to fail. Thus we set k = 0 and this means that s0 = s + 2, there is no
point in attempting a match at position s + 2 (recall that the next attempted match starts at
position s0 + 1).
(3) : We have T[s + 1, s + 2, s + 3] = P[1, 2, 3]. Is it worth trying for a match at position s + 2
or s + 3 of the text? An attempt at position s + 1 would look like:
b a b
b a
Clearly this would fail (we already knew this from the previous case of course). So now
we ask if it is worth trying for a match at position s + 3. Here the picture is:
b a b
b
Thus it is worth trying this (and from the information we have we cannot rule out a match).
So k = 1 and s0 = s + 2.
It is worth stressing again that although the discussion above has referred to parts of the text
input, all these are simply initial segments of the pattern. For any initial segment (i.e., value of q)
we have a value for ?(q). The references to the text were simply a matter of convenience and
for aiding understanding. As noted above, for a given value of q the value of(q) is simply the
largest k such that
P[q
k + 1 ..q] = P[1 ..k].
Note that for k = 0 both sides of this equation consist of an empty array so the equation holds.
In terms of a picture we require the pattern to look like:
Pattern up to position q a1 ··· aqk
a1 a2 ··· ak
Pattern shifted to position q
k + 1 a1 a2 ··· ak
7
In intuitive terms just think about keeping one copy of the pattern up to position q. We then
place another copy under it but shifted along by one position, ignore the last entry (it falls out
of scope), and check for a match. If not we shift the bottom along by one more place and again
check for a match with the characters directly above. We keep doing this till we either find a
match (in which case k is the number of entries matched) or the bottom copy falls off the right
hand end (in which case k = 0). You should now be able to see that the lengthy discussion above
for the pattern bab can be summarised as:
q 1 2 3
Pattern b a b
Shift 1 b a
Shift 2 b
(q) 0 0 1
Make sure you understand how we can work out ?(q) for each q straight from the array of the
pattern and its shifts. Start with q = 1, how much of the displayed array is relevant? For each q
cover up the irrelevant part, if any, and then look for the maximum number of characters that
match completely on any shift line with the relevant part of the pattern.
The method we have used so far to compute ? is not particularly efficient (as a function of m,
the length of the pattern P). The crucial aspect of the approach is that we can compute the
function very efficiently. Here is the pseudocode (recall that we hold the value of ?(i) in an array
which we will also call ?):
Compute-Prefix-Function(P)
1. m P.length
2. ?[1] 0
3. k 0
4. for q 2 to m do
5. while k > 0 and P[k + 1] 6= P[q] do k [k]
6. if P[k + 1] = P[q] then k k + 1
7. [q] k
8. return
Remember that in this section we assume arrays start at index 1 but of course in Java they start at
index 0 so you will need to make an adjustment from the pseudocode to the Java implementation
of Compute-Prefix-Function, see the note on p. 13. It is by no means obvious that this algorithm
is correct, indeed the runtime is also not obvious. A good way to begin the study of an algorithm
is to simulate it on simple examples. In order to help you with this here are two further examples
of patterns and their functions:
1. P = bbcbbabbc q 1 2 3 4 5 6 7 8 9
?(q) 0 1 0 1 2 0 1 2 3
2. P = ababbabababc
q 1 2 3 4 5 6 7 8 9 10 11 12
(q) 0 0 1 2 0 1 2 3 4 3 4 0
8You should first check that these functions are correct by computing ? by the naive method. Then
simulate Compute-Prefix-Function, at least for the first few values. In fact, the supplied code
(see §3.1) has an implementation of the naive method of computing the prefix function (note
that this does not implement Compute-Prefix-Function). This is in the Matcher class and the
method is:
public static int[] buildPrefixFunctionNaively(String pattern)
Important: This method is supplied to help you in cross checking your work. You must not
use it as part of your implementation since it would invalidate all the timing tests (in effect you
would be implementing an algorithm that is significantly slower than naive matching!).
A proof of correctness of the Compute-Prefix-Function is not easy. Again it would be a
good exercise to try proving the correctness of the algorithm, or more accurately think about
how you might approach such a task (put a limit on the time you spend on this). You can find a
complete proof in the book Introduction to Algorithms by Cormen, Leiserson, Rivest and Stein.
The worst case runtime is ?(m) but a naive analysis does not work because of the presence
of a while loop (on line 5) within the overall for loop. We can derive the bound by using a form
of amortized analysis. We will not go into details for Compute-Prefix-Function because the
approach is similar to the one discussed below for KMP-Matcher.
We are now ready to look at a more efficient string matching algorithm.
KMP-Matcher(T,P)
1. n T.length
2. m P.length
3. Compute-Prefix-Function(P)
4. q 0
5. create new (empty) queue Q
6. for i 1 to n do
7. while q > 0 and P[q + 1] 6= T[i] do q [q]
8. if P[q + 1] = T[i] then q q + 1
9. if q = m then
10. enqueue(Q, i
m)
11. q [q]
12. return Q
The correctness of this algorithm is also not obvious but not so hard to establish given the correctness
of Compute-Prefix-Function. The next exercise will find the worst case runtime. Note
that this is a type of analysis that we do not otherwise cover in the course. It has therefore been
broken down into small and simple steps so that the things you have to do are ones that are common
with the type of analysis we have seen many times. Finally, although the question shows
three main items, the last one concludes the analysis for you and is not a question as such; in
your submission include only answers to the first two items.
Exercise 2. Define the function (s)
as follows: (0)
= 0 while for s > 0 the value of (s)
is
the value of q just after the body of the for loop of line 6 in KMP-Matcher has executed for the
value i = s. Consider an execution of the for loop on line 6 when i = s and count the following:
9
the number of executions of the statement controlled by the while loop on line 7;
the number of executions of the statement controlled by the if on line 8;
the number of executions of the statements on lines 10, 11 (controlled by the if on line 9).
Denote this count by cs. Define
cs = cs + (s).
This is called the amortized cost of the execution of the loop and the function
is called a
potential.
1. Let c be the total cost of executing the for loop in KMP-Matcher, counted in terms of
the statements listed above, and c the total amortized cost (i.e., c = c1 + ··· + cn and
c = c1 + ··· + cn). Prove that c = c + (n)

(0).
Explain briefly why it now follows
that c c. [15%]
2. Prove that cs = O(1) for all s (indeed cs 4). Break this down into the following easy
parts:
(a) Let q0 be the value of q before the while loop on line 7 is executed (i.e., q0 = (s1))
and let q1 be the value of q just after the while loop terminates. Let ws be the number
of times that the while loop body is executed. Prove that ws+q1q0
0. Recall that
(q) < q for all q; what happens to the value of q with each execution of the loop?
(b) Let q2 be the value of q after line 8 is executed. Prove that 1 + q2
q1 2.
(c) Let q3 the value of q just after the if statement starting on line 9 is executed (i.e.,
q3 = (s)).
Prove that 2 + q3
q2 2.
(d) Now put these observations together to deduce the overall claim. Note that cs ?
ws +1+2, thus
cs (ws + 1 + 2) + (s).
NOTE: Each of these requires no more than a few lines to justify. So think carefully and
express your justification very clearly. Overlong or unclear answers will receive very much
reduced credit and any answers that cannot be understood after 2 minutes (per part) will
be awarded 0. [20%]
3. We now use the preceding two main parts to deduce that the worst case runtime of the
algorithm is O(n); there is nothing for you to submit here but you should read through this
conclusion of the analysis and make sure you understand it. We use without proof the fact
that the worst case runtime of Compute-Prefix-Function is ?(m); recall that m n.
Lines 1, 2, 4 and 12 of KMP-Matcher all cost O(1) while line 3 is O(m). The cost
of executing the for loop on line 6 for i = s is at most acs + b for some constants a
and b; if you have time give a brief justification of this observation. It follows that the
10overall cost of executing the for loop on line 6 is at most ac + bn and hence at most
ac + bn. From the preceding part we deduce that ac + bn = O(n). Thus the total cost is
O(1) + O(1) + O(1) + O(1) + O(m) + O(n) = O(n) since m n.
In fact, we can improve our analysis to show that the worst case runtime is (n) but we
will not do this here (would you expect it to be sub-linear).
§3. Software tasks
We first give a guide to some of the suppled classes and methods (some more are mentioned
later).
§3.1 Supplied software
The files that you will need for this coursework can be found on-line at the coursework webpage:
http://www.inf.ed.ac.uk/teaching/courses/inf2b/coursework/cwk1.html
This page contains a file called inf2bcw1.tar which you should download. The file is tarred
so you need to extract it, e.g. by tar -xf inf2bcw1.tar, this will create a subdirectory
src of your current one that contains the software. The java package for this coursework is
package matcher. The only class that should be changed is the nearly empty class in the file
StudentClass.java which is where you will implement your part of the project. In fact,
the amount of code you need to write is fairly modest.
Recall that in the discussion above we assumed that arrays were indexed starting from 1.
However Java starts with 0. Rather than invent a new class in Java it is simplest to subtract 1
from all array indices. This is very important otherwise your software will not work.
Note: It is a part of the exercise that you translate the description of the algorithm based on
arrays starting at 1 to arrays starting at 0 in Java. As mentioned above, this is a deliberate choice
so that you practice switching conventions. In this case it is a very simple matter. You cannot
throw away location 0 of Java arrays and start using them from 1 onwards.
Important: Do not remove any headers from the supplied software including those in the
StudentClass.java file.
In the software the text and pattern will be represented by String variables text and
pattern; our alphabet is just any valid Java character (in experiments we will use a restricted
alphabet, this is discussed below). The following methods and classes are already supplied for
you.
class Queue. This implements the queue we use to hold the offsets for occurrences of the
text. In fact, for this practical we only need two methods.
– public Queue enqueue(Integer offset)
This puts s at the end of the queue.
11
– public Queue toString()
This simply returns a comma separated string of all the entries in the queue so that
you can print it out for checking your implementation. If the queue is empty then the
empty string is returned.
Class Matcher that contains various methods including
– public Queue matchNaively(String text, String pattern)
This takes text and pattern and uses NaiveMatcher to return a Queue structure
which consists of all the shifts in the text at which the pattern can be found. For example
if our text is "aababaab" and the pattern is "aa" then the returned Queue structure
has 0, 5 as its elements in this order. With the same text but with the pattern "aaa" the
returned Queue is empty.
§3.2 Tasks
Note: You are asked to write some code that is in fact fairly simple thanks to the supplied
methods. Any code that is incorrect will be awarded 0. Correct code will be marked for style
as well. This means that it should have useful comments and helpful indentation. Note that
pointless comments (e.g., stating the obvious such as “now we add the two variables together”)
or excessive indentation are almost as bad as the complete absence of either.
Important: The checking of your code includes an automated procedure. Partly for this reason,
you must follow the specification exactly and must not change any of the method names or their
parameters. Otherwise your code will fail the test and be awarded 0. In particular remember not
to remove any headers including those in the StudentClass.java file.
The “texts” that we will be searching are in fact DNA sequences. A DNA sequence or genetic
sequence is a succession of letters representing the primary structure of a real or hypothetical
DNA molecule or strand, with the capacity to carry information as described by molecular biology.
The possible letters are A, C, G, and T (note they are all in upper case) representing the four
nucleotide bases of a DNA strand: adenine, cytosine, guanine and thymine that are linked to a
phosphodiester backbone. Sequences can be derived from the biological raw material through a
process called DNA sequencing. This description is purely for background, you do not need to
know it for the exercise, all that is needed is that our alphabet consists of A, C, G, T (and indeed
this is only relevant to carrying out tests).
Your programming tasks are as follows.
1. In the StudentCode class, provide implementations of the method
public static int[] computePrefixFunction(String pattern) [10%]
In the StudentCode inner class KMPMatcher, implement the string matching algorithm:
public void search() [15%]
12Note: Recall that the function
. In
§
2 we used the
convention that arrays start at index 1. However in Java they start at index 0. Since we
are just computing the array that defines
we now reduce each argument to the abstract
function
by 1. To be clear if pi is the Java array that gives us the values of the function
then
is given by pi[i-1]. If we were implementing a Java method getPi, say, then
we could hide this so that a call to getPi(i) looks up pi[i-1] and returns this as the
value. However to avoid extra coding we just work directly with the array pi
.
Keep your code straightforward, follow the algorithms as outlined in
§2.2 (taking care
of the issue regarding array indexing). Note that in designing the matching algorithm we
made the reasonable assumption that the pattern is not longer than the text. Your code must
test for this and return an empty queue if the pattern it strictly longer than the text. Remember
also that you must not use the supplied method buildPrefixFunctionNaively
as part of your implementation and if you do then you will be awarded 0 for the entire
coding part
.
Test your implementation by using the following methods in the Matcher class: public static Boolean testPrefixFunction(String pattern)
This checks for occurrences of pattern with itself as discussed for building the
prefix function. Returns true if the list of occurrences (represented as shifts) is
correct otherwise false. You will not get any other information.
public static void testKMPMatcher(int t, int l)
Generates
t random patterns of length
l and finds their occurrences in a fixed text
using the method KMPMatcher.find. If all results are correct, then it returns an
appropriate message. Else, it reports on the failure. Note that the integer parameters
here and elsewhere must not be negative (an exception is
联系我们 - QQ: 99515681 微信:codinghelp
程序辅导网!