首页 > > 详细

General Specifications Use Python in this homework

 General Specifications

1. Use Python in this homework.
2. Make sure to break down your program into classes/methods/
functions corresponding to the parts in this specification. They 
will be tested separately.
3. The function signatures in this specification are informal; their 
purpose is to explain the inputs and outputs of the methods.
4. Very important: At certain points, the assignment may be 
underspecified - this is by design. In those cases, make 
your own choices and assumptions and be prepared to 
defend them. 
 5.Name part A as partA.py ,part b as partB.py 
Part A: Word Frequencies (40 points)
• Method/Function: List tokenize(TextFilePath)
Write a method/function that reads in a text file and returns a 
list of the tokens in that file. For the purposes of this project, a 
token is a sequence of alphanumeric characters, independent 
of capitalization (so Apple, apple are the same token).
• Method:        MapcomputeWordFrequencies(Li
st)
Write another method/function that counts the number of occurrences of each token 
in the token list.
• Method:         void print(Frequencies)
Finally, write a method that prints out the word frequency counts onto the screen. 
The print out should be ordered by decreasing frequency. (so, highest frequency 
words first)
The TA will use their own test text files. For this part, it is expected 
that your program will read this text file, tokenize it, count the 
tokens, and print out the token (word) frequencies. Your program 
must run from the command line: write a program that takes one 
text file as an argument and outputs the token frequencies.
Please, use one of the output formats when you print out the 
result: 
\t
-
=
>
->
=>
 
Part B: Intersection of two files (60 points)
Write a program that takes two text files as arguments and outputs 
the number of tokens they have in common. Here is an example of 
input/output:
• You can reuse the code you wrote for part A.
• The TA will use their own text files. Note that some of the text 
files may be VERY LARGE.
• For this part, programs that perform better will be given more 
credit than those that perform poorly.
Common Tasks
• For both part A and part B, please add a brief runtime 
complexity explanation for your code as a comment on top of 
each method or function (does it run in linear time relative to 
the size of the input? Polynomial-time? Exponential 
time? ). This explanation and your code's actual 
conformance with this explanation will be the basis for 
evaluating the performance of your program.
• You should get the file names from command line 
arguments. Do not hard code the input file names in your 
code or read them from system standard input (stdin). As the 
assignment will be graded using an automatic grader, not 
doing this will result in losing the whole credit for the 
assignment.
• Exception handling is required for bad inputs. An example 
of bad input would be a character in a non-english language. 
Your code should be able to tokenize the whole input file even 
though there may be some bad inputs in it. You should be 
able to skip the bad input and continue with the rest. If your 
code throws an exception in the middle of tokenizing a TA's 
input test case, you will lose the whole credit for that test 
case.
Evaluation Criteria
Your assignment will be graded on the following four criteria.
1. Correctness (40%)
1. How well does the behavior of the program match the 
specification?
2. How does your program handle bad input?
2. Understanding (30%)
1. Do you demonstrate an understanding of the code?
3. Efficiency (30%)
1. How quickly does the program work on large inputs?
item number
联系我们 - QQ: 99515681 微信:codinghelp
程序辅导网!