CS 540 Fall 2018
CS 540: Introduction to Arti cial Intelligence
Homework Assignment # 6
Assigned: 10/16
Due: 10/30 before class
Question 1: An n-gram Chatbot [100 points]
In this question you will implement a chatbot by generating random sentences from the WARC (Wisconsin
AI100 Rejoinder Corpus) corpus using n-gram language models.
We have created a vocabulary le vocabulary.txt for you to interpret the data, though you do not need it
for programming. The vocabulary is created by tokenizing the corpus, converting everything to lower case,
and keeping word types that appears three times or more. There are 4699 lines in the vocabulary.txt le.
Download the corpus WARC201709 wid.txt from the homework website (note: this is a processed word
ID le, not the original text corpus). This le has one word token per line, and we have already converted
the word to its index (line number) in vocabulary.txt. Thus you will see word indices from 1 to 4699. In
addition, we have a special word type OOV (out of vocabulary) which has index 0. All word tokens that are
not in the vocabulary map to OOV. For example, the rst OOV in the corpus appears as
392 are
1512 entirely
0 undermined
12 .
The words on the right are provided here for readability, they are not in the corpus. The word \undermined"
is not in the vocabulary, therefore it is mapped to OOV and have an index 0. OOV represents a set of out-
of-vocabulary words such as \undermined, extra-developed, metro, customizable, optimizable" etc. But for
this homework you can treat OOV as a single special word type. Therefore, the vocabulary has v = 4700
word types. The corpus has 228548 tokens.
Write a program Chatbot.java with the following command line format, where the commandline input
has variable length1 and the numbers are integers:
$java Chatbot FLAG number1 [number2 number3 number4]
(Part a, 10 points) Denote the vocabulary by the v word types w0;w1;:::;wv 1 in the order of their index (so w0 has
index 0 and represents OOV, and so on). For this homework it is important that you keep this order
so that we can automatically grade your code.
You will rst create a unigram language model. This is a probability distribution over the vocabulary,
for word type w2f0;:::;v 1g the estimated probability is
pi p(w = i) = c(w = i)n ;
where c(w = i) is the count of word type i in the corpus (i.e. how many times wi appeared). Note you
need to estimate and store pi for all v word types, including OOV: p(w = OOV ) is the fraction of 0’s
in the corpus. Your p(w) should sum to 1 over the vocabulary, including OOV.
1We provide code skeleton that already handles variable length input.
CS 540 Fall 2018
When FLAG=100, number1 speci es the word type index i for wi. You should print out two numbers
on two lines: c(w = i) and p(w = i). When printing the probabilities for this homework, keep 7 digits
after the decimal point and perform. rounding. For example,
(Part b, 10 points) Now you implement random sampling from a probability distribution. That is, you will generate a
random word type according to its unigram probability. Here is how you do it:
1. Given the multinomial distribution parameter (p0;p1;:::;pv 1), you split the interval [0;1] into
v segments. Segment 0 is [l0 = 0;r0 = p0]. Note it is closed on the left. Segment i (for
Note these segments are open on the left but closed on the right. Also recall that we want you to
order these segments by their word type index.
2. You generate a random number r uniformly in [0;1].
3. You check which segment r falls into, and output the index of that segment.
A word on sparse storage: this is not an issue for unigrams, but soon you will have many pi’s being zero
in bigrams and trigrams. When you implement the segments above, you should not create segments
for any pi = 0. Mathematically, such segments are empty and will never get selected. Storage-wise,
you do not want to waste space on them. Instead, the suggested data structure is to record the the
triple (i;li;ri) in increasing order of i but only for nonzero pi’s. This is known as sparse storage format.
Again, be sure to arrange the segments in increasing order of word type index.
CS 540 Fall 2018
However, in order to test your code in a reproducible way, we will specify the random number r from
commandline. Speci cally, When FLAG=200, we provide number1 and number2 (we guarantee that
number2 number1), and you should let r = number1=number2 (remember to use Java ‘double’ here
so you don’t get an integer zero!) instead of a random r. Your code should output three numbers on
three lines: the word type index i that this r selects, li the left end of wi’s interval, and ri the right
end of wi’s interval.
$java Chatbot 200 326 10000
0
0.0000000
0.0326715
$java Chatbot 200 327 10000
1
0.0326715
0.0328290
$java Chatbot 200 329 10000
2
0.0328290
0.0344873
$java Chatbot 200 5000 10000
2364
0.4998906
0.5147278
$java Chatbot 200 99997 100000
4699
0.9999650
1.0000000
(Part c, 20 points) Now you will create a bigram language model of the form. p(wjh), where both w and h (the history)
are word types in the vocabulary. Fixing h, p(w j h) is a multinomial distribution over word types
w = 0;:::;v 1, and is estimated as follows:
p(wjh) = c(h;w)Pv 1
u=0 c(h;u)
;
where c(h;w) is the count of the bigram (adjacent word pair) h;w in the corpus. These counts are
obtained by letting the history start at the rst word position in the corpus, then gradually moving
the history one position later, until nally the (history, word) pair \use up" the corpus. For bigrams,
that means history stops at the 2nd to last word position in the corpus. For example, if the corpus
is \cake cake cake you want cake cake" then c(cake;you) = 1, c(cake;cake) = 3, c(cake;want) = 0.
Note it is perfectly ne to estimate p(w = ijh = i) for the same word type i. It is also perfectly ne
if either w or h or both are OOV.
CS 540 Fall 2018
It is possible that c(h;w) = 0 for some history, word combinations. As long as the history appears in
the corpus (which is the case for our bigrams), we naturally have the estimate p(wjh) = 0.
The above discussion is for a xed h, where p(wjh) is a multinomial distribution. You will need to
do so for all possible h = 0;:::;v 1, so that you will end up with v multinomial distributions. This
is where the sparse storage becomes important.
When FLAG=300, number1 speci es the history word type index h, and number2 speci es the word
type index w. You should print out three numbers on three lines: c(h;w), Pv 1u=0 c(h;u), and p(wjh).
For example,
$java Chatbot 300 414 2297
1054
1082
0.9741220
$java Chatbot 300 0 0
406
7467
0.0543726
$java Chatbot 300 0 1
0
7467
0.0000000
$java Chatbot 300 2110 4240
115
917
0.1254089
$java Chatbot 300 4247 0
41
1435
0.0285714
(Part d, 10 points) Now you will use the same function in Part b to sample from a bigram given h. That is, instead of
using the unigram probability p(w), we x some h and you will generate a random word type from
p(w j h). The method is the same, you just need to do more bookkeeping and record the segments
separately for each history h. Speci cally, for history h the segments are:
Again, you should use sparse storage.
CS 540 Fall 2018
When FLAG=400, we provide number1 and number2 (we guarantee that number2 number1), num-
ber3 is the word type for history h, and you should let r = number1=number2 to pick the corresponding
word type w from p(w j h). Your code should output three numbers on three lines: the word type
index i that this r selects, lhi the left end of wi’s interval conditioned on h, and rhi the right end of
wi’s interval conditioned on h.
$java Chatbot 400 0 100 414
0
0.0000000
0.0009242
$java Chatbot 400 1 100 414
2297
0.0055453
0.9796673
$java Chatbot 400 98 100 414
2298
0.9796673
0.9861368
$java Chatbot 400 81 100 4697
4533
0.8000000
1.0000000
$java Chatbot 400 15 100 4442
4007
0.1463415
1.0000000
(Part e, 20 points) Finally you create a trigram language model of the form. p(w jh1;h2), where now the history is the
pair of word types h1;h2 in that order. Fixing h1;h2, p(wjh1;h2) is a multinomial distribution over
word types w = 0;:::;v 1, and is estimated as follows:
p(wjh1;h2) = c(h1;h2;w)Pv 1
u=0 c(h1;h2;u)
;
where c(h1;h2;w) is the count of the trigram (adjacent word triple) h1;h2;w in the corpus. For the cake
corpus c(cake;cake;you) = 1, c(cake;cake;cake) = 1, c(cake;cake;want) = 0, c(cake;you;want) = 1
and for u 6= want we have c(cake;you;u) = 0, c(want;cake;cake) = 1 and for u 6= cake we have
c(want;cake;u) = 0.
Now we have a new problem: the history h1;h2 may not appear in the corpus at all! For example,
h1 = you;h2 = cake never appeared. If so, you simply declare that p(wjh1;h2) is unde ned for any
CS 540 Fall 2018
w. We will handle the situation later. In fact, only a small fraction of possible trigram histories appear
in the WARC corpus. You should only store trigram probabilities for these histories. This is also part
of the sparse storage strategy.
When FLAG=500, number1 speci es the history word type index h1, number2 is h2, and number3 is
w. You should print out three numbers on three lines: c(h1;h2;w), Pv 1u=0 c(h1;h2;u), and p(wjh1;h2).
In the case that p(wjh1;h2) is unde ned, the third line should be the text unde ned. For example,
(Part f, 10 points) Now you will sample from the trigram model p(w j h1;h2). The method is the same, though you
will decline to generate a word if the trigram is unde ned. When FLAG=600, we provide number1
and number2 (we guarantee that number2 number1), number3 is h1 and number4 is h2, and you
should let r = number1=number2 to pick the corresponding word type w from p(w jh1;h2). When
this conditional probability is de ned, your code should output three numbers on three lines: the word
type index i that this r selects, lh1;h2;i the left end of wi’s interval conditioned on h1;h2, and rh1;h2;i
the right end of wi’s interval conditioned on the history. Otherwise, your code should output a single
line with text unde ned.
$java Chatbot 600 2 5 660 3425
2178
0.3636364
0.4545455
CS 540 Fall 2018
$java Chatbot 600 2 5 3001 104
3083
0.3529412
0.4117647
java Chatbot 600 50 100 496 4517
540
0.4545455
0.5000000
$java Chatbot 600 33 100 2591 2473
286
0.3086420
0.4444444
$java Chatbot 600 0 100 2297 414
undefined
$java Chatbot 600 0 100 496 4517
5
0.0000000
0.0795455
(Part g, 20 points) Now the fun begins! You will generate random sentences using your n-gram language models. But for
building a chatbot, we will specify a sentence pre x s1;s2;:::;st which are t initial words (represented
by word type indices) in the sentence. Your code will complete this sentence as follows:
1. set seed for randomizer
2. Repeat:
(a) h1 = st 1;h2 = st
(b) generate a random word st+1 from ~p(wjh1;h2)
(c) t = t + 1 // shifts the trigram history by one position in the next iteration.
3. Until the generated word is a period, or a question mark, or an exclamation mark.
Note in step 1(b) a complication arises from the sentence pre x, and we introduced a placeholder ~p:
{ The sentence pre x is empty. In this case, simply let ~p(w jh1;h2) be the unigram model p(w)
which does not require any history.
{ The sentence pre x has only one word s1. In this case, let ~p(w j h1;h2) = p(w j h = s1) the
bigram model.
{ The sentence pre x h1 = st 1;h2 = st as history is unde ned for a trigram model. If so, let
~p(wjh1;h2) = p(wjh2) the bigram model.
CS 540 Fall 2018
{ Otherwise, let ~p(wjh1;h2) = p(wjh1;h2) the trigram model.
When FLAG=700, number1=seed, which is the seed of the random number generator; number2=t
(which only needs to be 0, 1, or 2), and the next t numbers on the commandline specify the sentence
pre x s1;s2;:::;st. We will guarantee that si is not period, or a question mark, or an exclamation
mark.
If seed = 1, you actually do not set the seed (this allows you to generate di erent random sentences).
Otherwise you should set the seed to seed. To set the seed in Java, use the following code:
Random rng = new Random();
if (seed != -1) rng.setSeed(seed);
In step 1(b) each time you should generate a new random number r2 [0;1] in order to generate the
random word. This should be done with
rng.nextDouble();
You should try your code multiple times with the same sentence pre x: when seed = 1 your code
should complete the sentence in di erent ways; otherwise it should be the same completion.
Your code will output the completed sentence (starting at st+1), one word index per line.That last sentence is a little jarring to this professor because it is:
barely loopholes because they believe would hinder gaining public trust will be created by ai
, the professor is a black box to be implemented into di erent languages with a considerably
smaller amount of data .
CS 540 Fall 2018
(Part h, no points) For this part you do not need to develop your code any further, but you will test out the Chatbot that
you have developed by actually \talking" to it.
Please download ChatbotDriver.java and place this le together with your Chatbot.java in the
same directory. This driver class basically takes the user input, apply simple rules to generate a pre x
based on the input (or just use the input itself as the pre x), and call \java Chatbot 700 -1 prefix"
to generate a response and visualize it as actual words (\OOV" will be displayed for OOV indices).
You can compile both les and try chatting with your Chatbot. Also make sure you have the txt les
in the directory as well.
Note: because of randomness, your results will di er.
$javac Chatbot.java ChatbotDriver.java
$java ChatbotDriver
You: What’s your opinion on self-driving cars?
Chatbot: self-driving cars right now , tesla still received plenty of OOV and
needed some more focus towards intellectual fields .
You: healthcare system
Chatbot: healthcare system can be useful .
You: tell me a joke
Chatbot: to assume that those personal information exposure or government
website being hacked can bring down an interstate .
You: say anything
Chatbot: politicians into supporting their control over your life would be
an ideal way of monitoring the physical hardware is not yet implemented
technology in terms of employment market .
You: what is your idea of an ideal world?
Chatbot: my idea of an ideal world we live our everyday lives .
Have fun and try to make it more intelligent by modifying ChatbotDriver.java! (e.g. adding more
rules to the generateCommand method.)