7. We have already (essentially) discussed how to compute the tf values. Ok, almost. We talked about
how to count the “raw frequencies” (number of occurrences of each word) for a set of words in one
text file. To convert those to tf values, we would want to divide those raw frequency values by the
total number of words in the file. How do you perform. that task?
This is why I suggest using an appropriate HashMap to store these tf values, one for each file.
8. Once you find the tf values for each file, how are you going to compute the idf values? Once again,
this is (more or less) the same as computing frequency counts in a file, except that instead of a file,
you have a collection of (previously) computed word frequencies for each text file.
For each term (word) t that appears in *any* of the text files, how do you count how many files that
term appears in? If you can figure out how to do that, you are nearly done computing the idf values.
How do you determine n, the number of documents in the collection we consider?
9. Note that the logarithm in the definition of idf is the “base 10 logarithm”. You can get this function
in Java with the method Math.log10(x) for a positive number x.
(Actually, it doesn’t really matter much what base you use for the logarithm, since logarithms with
different bases differ by a constant factor. Use “base 10” so it’s easy to compare answers with mine.)
10. Finally, given the tf values for each document and the idf values for each word (over the set of
documents), how do you combine these to compute tfidf values for each word, in each document?
11. Keep in mind the special case when there is only one text file (document) supplied to your program.
What do you do then? (Again, I have already answered this question. See the “Note” above.)
12. Requirement: I want you to print out the word with highest tfidf value for each of the text documents.
If there are several such words, print out any one of those words with the highest tfidf value.
7
Sample Output
You can find the files that I used on the module website. Look for the file called “Samples.zip”, download,
and unzip that file. The files included there are:
Chap-4-Frank.txt (Chapter 4 of Frankenstein, or The Modern Prometheus by Mary Shelley)
Adv-3.txt (“A Case of Identity”, included in The Adventures of Sherlock Holmes by Sir Arthur Conan
Doyle)
Dunwich.txt (The Dunwich Horror by H.P. Lovecraft)
time.txt (The Time Machine by H.G. Wells)
Note: These texts are not “perfect”, in the sense that there could be misspellings present, or words that
run together because spaces are missing, etc. That doesn’t matter for our purposes here. Just use the texts as
given.
All of these files were downloaded from Project Gutenberg. (You should really check out this website if
you do not know about it. Lots of classic books available for free.) Having said that, I am also obligated to
say the following:
These eBooks are for the use of anyone anywhere in the United States and most other parts of
the world at no cost and with almost no restrictions whatsoever. You may copy it, give it away
or re-use it under the terms of the Project Gutenberg License included with these eBooks or
If you are not located in the United States, you’ll have to check
the laws of the country where you are located before using these eBooks.
The Project Gutenberg license is available here.
$ java TFIDF Adv-3.txt Dunwich.txt Chapter-4-Frank.txt time.txt
Max TFIDF value for each file.
==========
Adv-3.txt
==========
holmes 0.003912241785716382
==========
Dunwich.txt
==========
whateley 0.002523302562145693
==========
Chapter-4-Frank.txt
==========
feelings 0.0014205111867745869
==========
time.txt
==========
weena 9.886042550541253E-4
Note that the tfidf values have picked out the name (or surname) of a main character in three out of the
four documents, which makes sense as those names appear many times in each of their respective texts, but
in none of the others (so their idf values are high). (If you’re familiar with these works, you recognize those
names. . . )
8
$ java TFIDF Adv-3.txt Dunwich.txt Chapter-4-Frank.txt
Max TFIDF value for each file.
==========
Adv-3.txt
==========
holmes 0.0031003782620574196
==========
Dunwich.txt
==========
whateley 0.001999669969487269
==========
Chapter-4-Frank.txt
==========
pursuit 0.001313349895020699
It’s interesting (to me at least) how removing one of the files changes which word has the highest tfidf
value in one of the other files (obviously the idf values of words have changed, but the tf values haven’t in
each document).
$ java TFIDF Chapter-4-Frank.txt
Max TF value for this file.
==========
Chapter-4-Frank.txt
==========
the 0.053480141565080616
As stated, for a single document, we output a tf value instead of a tfidf value, since the idf value would
always be zero. In this case, the highesttf value picks out a rather common word in the document. (The same
word “the” will be selected for each of the individual files, i.e. it has the highest frequency in each single
document. Hence its idf value (and tfidf value) is 0 for any set of more than one of these documents.)
$ java TFIDF when.txt Chapter-4-Frank.txt
Max TFIDF value for each file.
==========
when.txt
==========
necessary 0.012542916485999216
==========
Chapter-4-Frank.txt
==========
i 0.012192721019815203
The comparison of a short document and a long one gives a slightly odd result. The word “I” is contained
in one of them and not the other, so is the word with highest tfidf value in that document. Its tf value is
large in “Chapter-4-Frank.txt” and its idf value is non-zero since “I” doesn’t appear in “when.txt”.
9
A note on testing
Make some small files on which you test your application program. A single file means you are (or should
be) calculating and showing tf values. So it’s easy to make a small file, and check that the term frequencies
are being calculated correctly. Similarly, for two (small) files, it’s relatively easy to check that the tfidf
values are correct. Do those calculations by hand to verify they are correct!
If tfidf is a HashMap variable in a Java program, then this Java statement will print out the entire
HashMap, which will give you large ouput if the HashMap is big.
HashMap tfidf = new HashMap();
...
...
System.out.println(tfidf);
The toString method of the HashMap class has been overridden to give this output. But you can print
out the whole HashMap for small examples to verify the tfidf values are correct in those cases.
Here’s a small(ish) example (where I have added some line breaks to the output to fit it onto the page).
We can see that terms that appear in both documents have a tfidf value of 0 (because the idf value is 0).
$ java TFIDF when.txt fear.txt
Max TFIDF value for each file.
==========
when.txt
==========
necessary 0.012542916485999216
{which=0.0, necessary=0.012542916485999216, in=0.012542916485999216,
one=0.012542916485999216, another=0.012542916485999216,
for=0.012542916485999216, political=0.012542916485999216,
them=0.012542916485999216, it=0.012542916485999216,
bands=0.012542916485999216, when=0.012542916485999216,
people=0.012542916485999216, becomes=0.012542916485999216, the=0.0,
connected=0.012542916485999216, with=0.012542916485999216, f=0.0,
dissolve=0.012542916485999216, have=0.0, course=0.012542916485999216, to=0.0,
human=0.012542916485999216, events=0.012542916485999216}
==========
fear.txt
==========
fear 0.0177076468037636
{needed=0.0088538234018818, unreasoning=0.0088538234018818,
convert=0.0088538234018818, unjustified=0.0088538234018818, we=0.0088538234018818,
advance=0.0088538234018818, firm=0.0088538234018818, that=0.0088538234018818,
into=0.0088538234018818, assert=0.0088538234018818, f=0.0,
me=0.0088538234018818, nly=0.0088538234018818, have=0.0, let=0.0088538234018818,
nameless=0.0088538234018818, so=0.0088538234018818, belief=0.0088538234018818,
fear=0.0177076468037636, all=0.0088538234018818, which=0.0,
terror=0.0088538234018818, paralyzes=0.0088538234018818, is=0.0088538234018818,
my=0.0088538234018818, the=0.0, retreat=0.0088538234018818,
itself=0.0088538234018818, efforts=0.0088538234018818, to=0.0,
thing=0.0088538234018818, first=0.0088538234018818}
10
A question to consider
You don’t need to write any code for this question, just comment on what I am asking here.
If we are considering only a single document (or file in our case), as stated theidf value doesn’t make any
sense, because it will always be 0 for any word in the document. (There is a single document, so n = 1, and
any word in that document obviously appears in the document so idf(t;d1;D) = log(1=1) = log(1) = 0.)
So how could we proceed in this case? Instead of tfidf values, I suggested considering tf values alone,
but then words with high tf values are likely to be non-interesting words like “and”, “the”, and “or”. (See
the above example where I had only one file, and the word “the” had the highest tf value.)
Would you have any suggestions how we might change our approach? (I’ve actually hinted at one
suggestion elsewhere in this assessment. . . ) (Note that there isn’t necessarily one “right answer” to this
question.)
Submission Instructions
Your submission should consist of a report (a PDF file) and implementation (source code) files. Be certain
to include all Java source files (i.e. the “.java” files) needed to run your application.
Submit one compressed file, using only the “zip” format for compression, that includes all files (report
and Java source code) for your submission.
The report (a PDF file) should consist of
Requirements: Summary of the above requirements statement in your own words. Do this for each
part of the assessment.
Analysis and Design: A short (one paragraph) description of your analysis of the problem including
a Class Diagram outlining the class structure for your proposed solution, and pseudocode for
methods. Again, do this for each part of the assessment. Pseudocode need not be a line-by-
line description of what you do in each method, but an overview of how methods operate, what
parameters they take as input, and what they return. Ideally, pseudocode should be detailed
enough to allow me (or someone else) to implement a method in any programming language of
their choosing, but not rely on language-specific constructs/instructions. You can, of course, say
that a method returns a Java HashMap (or something else), but then someone familiar with that
data structure knows the meaning (so you don’t need to explain what a HashMap is).
For Part I you must submit your Java file(s). In my solution I actually only have one class file that
includes several methods in it. You may arrange your solution differently than mine (possibly
using more than one class file), and that is fine.
For Part II you must again submit your Java file(s).
Testing: A set of proposed test cases presented in tabular format including expected output, and
evidence of the results of testing (the simplest way of doing this is to cut and paste the result of
running your test cases into your report).
Note that your programs will be tested against other inputs than the ones provided in my exam-
ples, but the test cases will satisfy the specifications outlined earlier (e.g. for Part II, the files
given to the program will exist, so you don’t need to check for existence/non-existence of the
files).
The implementation should consist of
Your Java source files, i.e. the relevant .java files, not the class (.class) files.
Upload your files (as a single compressed zip file) to
Submission Deadline
Friday, 13 April 2018, 5:00pm (Friday of Week 8)
Notes/Penalties
Because submission is handled electronically, ANY FILE submitted past the deadline will be consid-
ered at least ONE DAY late (penalty days include weekends since assessment is performed electroni-
cally). Late submissions are subject to the University Policy (see Section 6 of the Code of Practice on
Assessment).
Please make sure your Java classes successfully compile and run on ANY departmental computer
system. For this reason, the use of IDEs like NetBeans and Eclipse is discouraged. If your Java source
files do not successfully compile/run, you could suffer a penalty of 5 marks from your grade. This
penalty could be applied for Part I and/or Part II of the assessment, but only once per part. (However,
this will not turn a passing grade into a failing grade.)
If your report is not a PDF file, you will lose 5 marks from your final grade. (However, this will not
turn a passing grade into a failing grade.)
If you use any form. of compression other than “zip”, you risk your files not being read by the assessors,
resulting in a mark of 0! If your files can be read, you will still lose 5 marks from your final grade.
(As before, this will not turn a passing grade into a failing grade.)
Note this is an individual piece of work. Please note the University Guidelines on Academic Integrity
(see Appendix L of the Code of Practice on Assessment). You should expect that plagiarism detection
software will be run on your submission to compare your work to that of other students. (Obviously
in this case, the ThreeDice.java class file will be ignored as it has been supplied to you.)
Mark Scheme
Marks for each part of the assessment will be awarded for:
Analysis and Design 10% (This includes things like UML diagrams for your classes/application, jus-
tification for how/why you structure your classes, appropriate pseudocode for methods, etc.)
Implementation 25% (This includes correctness of your programs, appropriate level of comments,
correct indentation so your code is readable, having useful identifiers, etc.)
Testing 10% (Have you done suitable testing in terms of checking different paths through your code,
checking what you do with “unexpected” inputs, a sufficient number of tests, etc.?
Extra questions in Part I or Part II 5%
Please see the module web page for the feedback form.