3
Sample output
Here’s some sample output from my version of the program.
$ java BreakCaesar "htcs aplntgh, vjch, pcs bdctn"
==============================
htcs aplntgh, vjch, pcs bdctn
==============================
send lawyers, guns, and money
$ java BreakCaesar "aqw’nn pgxgt wpfgtuvcpf vjku"
==============================
aqw’nn pgxgt wpfgtuvcpf vjku
==============================
you’ll never understand this
$ java BreakCaesar "Wsccsccszzs novdk lveoc"
==============================
Wsccsccszzs novdk lveoc
==============================
Mississippi delta blues
$ java BreakCaesar "Vlcha siol mqilx. Qy unnuwe un xuqh."
==============================
Vlcha siol mqilx. Qy unnuwe un xuqh.
==============================
Bring your sword. We attack at dawn.
$ java BreakCaesar "Hvs eiwqy pfckb tcl xiadsr cjsf hvs zonm rcug."
==============================
Hvs eiwqy pfckb tcl xiadsr cjsf hvs zonm rcug.
==============================
The quick brown fox jumped over the lazy dogs.
$ java BreakCaesar
Oops, you haven’t given enough parameters!
Usage: java BreakCaesar "string"
Questions to consider
You don’t have to write any code, just comment on these questions I ask here.
What would we do differently if we know the language we’re examining isn’t English but some other
language (e.g. suppose we know the people communicating via this Caesar cipher usually writes/speaks in
Polish)?
Suppose we (somehow) know that the person doing the encryption uses one shift value for lowercase
letters, and a different shift value for uppercase letters. What would we have to do differently? How would
that affect our calculations, or how would we have to alter our program/calculations to account for this?
Part II
Introduction
Continuing along the lines we started in the course notes, let’s consider a little more in the area of text
analysis or “natural language processing”.
Given a set of documents, we want to find the “important” words that appear in this set of documents.
Firstly, we note that the meaning of the word “document” is dependent upon the collection we are consider-
ing.
For example, if we are looking at a set of tweets from Twitter, then it’s probably sensible to consider
each tweet as its own separate entity, so in this case the word “document” means one tweet.
If we are looking at a set of books, then one “document” could mean one book. Similarly, if we consider
a set of (academic) journal papers, one “document” might likely refer to an individual paper.
On the other hand, if we are examining a single book, it might make more sense to consider each chapter
as a “document”. Or perhaps we go down to the level where a “document” is a single paragraph of the book.
And so forth. . .
For the purposes of this exercise, we will consider a set of text files, and each “document” will be one
of these text files, but in the discussion that follows, keep the idea in mind that the word “document” can
depend upon the particular collection we consider.
Also keep in mind in what follows that we are always going to want to ignore the case of letters, so we
will want to begin our analysis by converting everything to lower case (say).
Term frequency-inverse document frequency
In order to define what is an “important” word in our set of documents, we will use a measure that is referred
to as “term frequency-inverse document frequency”. This is the product of two numbers, one being the “term
frequency” and the other being “inverse document frequency”.
To continue our definitions, suppose that our collection of documents is numbered, so that we have
documents d1;d2;:::;dn. Therefore, there are n documents in our collection.
“Term frequency” is defined for a term (or word) and a particular document. Suppose that t is a term in
a document d. The term frequency, tf(t;d), of t in d is defined as
tf(t;d) = number of occurrences of t in dtotal number of all words in d :
We can calculate these term frequencies for each term in each document. (We have essentially talked
about how to do this in class. In that example we were calculating the numerators of tf(t;d). We can
calculate the term frequencies by summing up those numerators and dividing by the sum. Keep in mind
these will give you decimal numbers, which is what we want.)
Term frequency is obviously a measure of how often a term appears in a particular document. If a term
t does not appear in the document di, then (by definition) we have tf(t;di) = 0.
Side note: There are other definitions of “term frequency” that are sometimes used. These can include
the raw frequencies (the numerators oftf(t;d)), boolean frequencies (these are 1/0 variables depending upon
whether a word exists/not exists in the document), etc.
“Inverse document frequency” is defined for a term (word) over the whole set of documents. Inverse
document frequency is a measure of how much “information” the term provides, i.e. how common or rare
the term is across all documents. One of the most common definitions of inverse document frequency is this
one (where t is a term, and D is referring to the collection of all documents d1;:::;dn):
idf(t;D) = log
n
number of documents containing
As defined here, idf(t;D) only makes sense if the term t appears in at least one of the documents,
otherwise you are dividing by 0. To account for this, sometimes idf(t;D) is defined by adding 1 to the
denominator to avoid a “division by zero” error. We will use the definition defined above, and assume that
we only consider terms that appear in at least one of the documents in our collection.
For a given term t, idf(t;D) is largest if t appears in only one one of the documents. On the other hand,
if t appears in all n documents, it means that idf(t;D) is zero. So common words like “the”, “and”, “or”
will often have idf values of 0, or close to 0, assuming we are looking at a large collection of documents.
These words give little information in the meaning of a document, and therefore we want to discount them.
This is what idf is supposed to do for us.
The “term frequency-inverse document frequency” of a term t is defined on a per-document basis for a
document di, over the whole collection of documents D.
tfidf(t;di;D) = tf(t;di) idf(t;D)
A high value of tfidf is achieved by a high-term frequency (in the given document) and a low frequency
of that term in the whole collection of documents. Therefore, these tfidf values tend to filter out common
terms (i.e. the tfidf values are low), and the terms with high values are often the terms of interest in further
analysis of the text.
These tfidf numbers are what we are interested in computing.
Note: If we have only a single document in our collectionD, it doesn’t make sense to adjust thetf values
by multiplying by the idf value of that term. This is because for each term in the (single) document, the idf
value will always be 0 (since log(1=1) = 0). Therefore, in the special case that we are considering a single
document, we revert to using tf values. In that case, however, we need to proceed carefully (and perhaps
differently). It’s likely that the high tf values will correspond to common (non-interesting) words like “the”,
“or”, etc.
A weblink
You can find a small example of computing tfidf values on the Wikipedia page for tf-idf. There is also a
similar explanation to what I have given here for the definition of tfidf on that webpage.
It should be clear that you can find many other resources online that will talk more about the concept of
tfidf values (weights). Keep in mind that these other resources might be using different definitions for “term
frequency” or “inverse document frequency”. They should be similar, but could result in different values in
the calculations.
Requirements
As mentioned, we want to compute tfidf values for a collection of documents. Or, more specifically, you
need to write a Java program that will compute tfidf values.
We note the following for this assessment (what follows are requirements for the assessment, as well as
some hints/suggestions):
1. Requirement: Call your application program “TFIDF.java”.
2. Requirement: Filenames of the text files will be supplied to the program via command line arguments.
In other words, this TFIDF program will be invoked with a call like
java TFIDF example1.txt example2.txt
This means that the filenames will be in the args array, i.e. the filenames are the values args[0],
args[1], etc.
3. For the purposes of this exercise, you can assume that the text files exist (you don’t have to check for
existence/non-existence of files).
4. You can assume that at least one filename is given to the program. (But it’s possible that only one
filename will be given. So what do you do in that case? See the “Note” above, right before the
“Requirements” section began.)
6
5. Things will likely be much easier if you use HashMap variables in this part of the assessment (or some
similar data structure). We used a HashMap in the example in the course notes when counting word
frequencies in a single document. Recall that we were looking at the number of occurrences of a word
in a document, so those are integer values. We want to compute something different (but related) here.
6. Java does not allow you to make an array of HashMap objects. (Ok, you might be able to do this, but it’s
kind of difficult to do, and not recommended.) You can, however, make an ArrayList of HashMap
objects. Here’s how you can declare such a thing and (start to) initialize and use it. (Obviously this is
only partial Java code, not a full-on application.)
// Declare an ArrayList of HashMap objects. Note that each
// HashMap uses the same (key,value) types.
ArrayList> list = new ArrayList> ();
// Declare an individual HashMap.
HashMap stuff = new HashMap();
// Add (key,value) pairs to the HashMap.
stuff.put("thing", 12.5);
stuff.put("more things", 20.0);
// Add the HashMap to the ArrayList.
list.add(stuff);
// Retrieve the first HashMap in the ArrayList, and the value correspoding
// the the key "thing". (This assumes that key exists in the HashMap.)
list.get(0).get("thing");