General Specifications
1. Use Python in this homework.
2. Make sure to break down your program into classes/methods/
functions corresponding to the parts in this specification. They
will be tested separately.
3. The function signatures in this specification are informal; their
purpose is to explain the inputs and outputs of the methods.
4. Very important: At certain points, the assignment may be
underspecified - this is by design. In those cases, make
your own choices and assumptions and be prepared to
defend them.
5.Name part A as partA.py ,part b as partB.py
Part A: Word Frequencies (40 points)
• Method/Function: List tokenize(TextFilePath)
Write a method/function that reads in a text file and returns a
list of the tokens in that file. For the purposes of this project, a
token is a sequence of alphanumeric characters, independent
of capitalization (so Apple, apple are the same token).
• Method: MapcomputeWordFrequencies(Li
st)
Write another method/function that counts the number of occurrences of each token
in the token list.
• Method: void print(Frequencies)
Finally, write a method that prints out the word frequency counts onto the screen.
The print out should be ordered by decreasing frequency. (so, highest frequency
words first)
The TA will use their own test text files. For this part, it is expected
that your program will read this text file, tokenize it, count the
tokens, and print out the token (word) frequencies. Your program
must run from the command line: write a program that takes one
text file as an argument and outputs the token frequencies.
Please, use one of the output formats when you print out the
result:
\t
-
=
>
->
=>
Part B: Intersection of two files (60 points)
Write a program that takes two text files as arguments and outputs
the number of tokens they have in common. Here is an example of
input/output:
• You can reuse the code you wrote for part A.
• The TA will use their own text files. Note that some of the text
files may be VERY LARGE.
• For this part, programs that perform better will be given more
credit than those that perform poorly.
Common Tasks
• For both part A and part B, please add a brief runtime
complexity explanation for your code as a comment on top of
each method or function (does it run in linear time relative to
the size of the input? Polynomial-time? Exponential
time? ). This explanation and your code's actual
conformance with this explanation will be the basis for
evaluating the performance of your program.
• You should get the file names from command line
arguments. Do not hard code the input file names in your
code or read them from system standard input (stdin). As the
assignment will be graded using an automatic grader, not
doing this will result in losing the whole credit for the
assignment.
• Exception handling is required for bad inputs. An example
of bad input would be a character in a non-english language.
Your code should be able to tokenize the whole input file even
though there may be some bad inputs in it. You should be
able to skip the bad input and continue with the rest. If your
code throws an exception in the middle of tokenizing a TA's
input test case, you will lose the whole credit for that test
case.
Evaluation Criteria
Your assignment will be graded on the following four criteria.
1. Correctness (40%)
1. How well does the behavior of the program match the
specification?
2. How does your program handle bad input?
2. Understanding (30%)
1. Do you demonstrate an understanding of the code?
3. Efficiency (30%)
1. How quickly does the program work on large inputs?
item number