辅导留学生R设计、R辅导、讲解R、R语言讲解留学生

Homework 4-Pt2- Spam Filter
Machine Learning Concept & Applications APANPS4335 1,
Columbia University, SPS
Due Tuesday, March 20 at 11.59PM
Total 125 Points
Directions: please submit your homework as documented Rmd
and html les on Canvas.
Questions 1 to 5 have a written component and an R component. Your write-up should
be clear, concise and include relevant output from R, such as plots and tables. Your grade
on these questions will be determined by the completeness, correctness and clarity of your
programming and writeup.
Naive Bayes Spam Filtering Walkthrough. Working with data can often be messy
and frustrating, especially when you are learning a language. The following is designed to
be an introduction to working with real world data. You will implement a spam lter for the
lingspam dataset using a Naive Bayes spam lter. This dataset contains legitimate emails
and spam sent to a linguistics email list.
Step 1: Download the dataset Homework4Data from the course website. Place it in your
working directory for R. The data are partitioned into ham and spam, and further partitioned
into training and testing with 350 examples each training set and 150 in each testing, for a
total of 1000. These are a collection of emails to a linguistics server list. They have been
preprocessed for you by 1) removing non-letter characters, 2)removing stopwords, and 2)
stemming. Stopwords are short functional words, such as in, is, the. Stemming involves
trimming in ected words to their stem, such as reducing running to run. For example, an
original ham email is
Subject: negative concord
i am interested in the grammar of negative concord in various dialects
of american and british english . if anyone out there speaks natively
a dialect that uses negative concord and is willing to answer grammaticality
questions about their dialect , would they please send me an email note
to that effect and i ’ ll get back to them with my queries . my address
is : kroch @ change . ling . upenn . edu thanks .
The cleaned version is
negative concord interest grammar negative concord various dialect american
british english anyone speak natively dialect negative concord answer
grammaticality question dialect please send email note effect ll back
query address kroch change ling upenn edu thank
1
Likewise, an original spam message is
Subject: the best , just got better
the 2 newest and hottest interactive adult web sites will soon be the
largest ! ! ! check out both of the following adult web sites free
samples and trial memberships ! ! ! ! live triple x adult entertainment
and pictures . new content added weekly ! ! ! girls , boys , couples
, and more ! ! ! check them both out : http : / / www2 . dirtyonline
. com http : / / www . chixchat . com
The cleaned version is
best better newest hottest interactive adult web site soon largest check
both follow adult web site free sample trial membership live triple
x adult entertainment picture content add weekly girl boy couple check
both http www dirtyonline com http www chixchat com
The original email database can be found at:
https://www.kaggle.com/omkarpathak27/identify-spam-using-emails/
We provided the cleaned versions for you. However, there are R packages like tm that can
clean datasets, or this can easily be done through custom code.
Step 2: Upload the data into R. Open a new document in R. Save it as NaiveBayesFiles.R.
We are going to make a function to read the emails from a directory into your workspace.
Copy the following code into NaiveBayesFiles.R:
#==============================================================
# To read in data from the directories:
# Partially based on code from C. Shalizi
read.directory <- function(dirname) {
# Store the emails in a list
emails = list();
# Get a list of filenames in the directory
filenames = dir(dirname,full.names=TRUE);
for (i in 1:length(filenames)){
emails[[i]] = scan(filenames[i],what="",quiet=TRUE);
}
return(emails)
}
# Example: ham.test <- read.directory("kag-ham-test/")
#==============================================================
The above code creates a reusable function in R, which takes the input argument dirname.
Use functions when you need to repeatedly use a bit of code. In this case, we will be reading
2
in data from four di erent directories. This code creates a list of emails. Lists store objects
with an ordering. In this case, the order is the dictated by the position in the directory. The
code is structured as follows:
emails = list() instantiates the variable email as a list, albeit with no entries cur-
rently.
filenames = dir(dirname,full.names=TRUE) creates a vector of strings, populated
by the names of the les in directory dirname. The option full.names=TRUE means
that the path directory is prepended onto the le name to give a path; if this is set to
FALSE, only the le names are given.
for (i in 1:length(filenames))f loops through all of the le names.
emails[[i]] = scan(filenames[i],what="",quiet=TRUE) lls list element i with
a vector (or list) of input values. Here what="" means that all elds will be read
as strings. The default separator is white space; this can be changed through the
sep argument. The argument quiet = TRUE keeps scan from displaying statistics
about the imported document, which will ll up a command screen when hundreds of
documents are read. The net result of this command is that element i of email is
lled by a vector of strings, with each string representing a word.
return(emails) returns the object emails when the function read.directory(dirname)
is called.
To be able to use a function, save it after making any changes, and then run the source
le by selecting File ! Source File... and then selecting NaiveBayesFiles.R.
Read in all four email directories:
ham.test <-read.directory("kag-ham-test")
spam.test <-read.directory("kag-spam-test")
ham.train <-read.directory("kag-ham-train")
spam.train <-read.directory("kag-spam-train")
Step 3: Make a dictionary. We are going to use all of the les (training and testing, ham
and spam) to make a dictionary, or a list of all the words used. First, copy the following
code into NaiveBayesFile.R:
#==============================================================
# Make dictionary sorted by number of times a word appears in corpus
# (useful for using commonly appearing words as factors)
# NOTE: Use the *entire* corpus: training, testing, spam and ham
make.sorted.dictionary.df <- function(emails){
# This returns a dataframe. that is sorted by the number of times
# a word appears
3
# List of vectors to one big vetor
dictionary.full <- unlist(emails)
# Tabulates the full dictionary
tabulate.dic <- tabulate(factor(dictionary.full))
# Find unique values
dictionary <- unique(dictionary.full)
# Sort them alphabetically
dictionary <- sort(dictionary)
dictionary.df <- data.frame(word = dictionary, count = tabulate.dic)
sort.dictionary.df <- dictionary.df[order(dictionary.df$count,decreasing=TRUE),];
return(sort.dictionary.df)
}
#==============================================================
The function make.sorted.dictionary.df(emails) takes the list emails generated by
read.directory() and creates a data frame. that holds 1) the list of unique words, and 2)
the number of times each appears in the corpus, which are sorted in descending order based
on frequency. We want the sorted list so that we can easily access the most used words. The
code is structured as follows:
dictionary.full <- unlist(emails) takes the list emails and converts it into a
very long vector, dictionary.full.
tabulate.dic <- tabulate(factor(dictionary.full)) makes a vector of counts
for each word that appears in dictionary.full. The argument
factor(dictionary.full) tells the tabulate function that it is to bin data according
to the enumerated categories for dictionary.full.
dictionary <- unique(dictionary.full) returns a vector of unique words in
dictionary.full.
dictionary <- sort(dictionary) sorts the vector dictionary from a to z; this is
necessary because factor(dictionary.full) is a sorted list of words in dictionary.full.
dictionary.df <- data.frame(word = dictionary, count = tabulate.dic) cre-
ates a new data frame, dictionary.df, with two categories, word and count. The
word category is populated by the alphabetized dictionary of factors, the count cate-
gory is populated by the counts for the alphabetized words.
sort.dictionary.df <- dictionary.df[order(dictionary.df$count,decreasing=TRUE),]
reorders the elements of dictionary.df to be sorted in descending order according to
the count values. order(dictionary.df$count,decreasing=TRUE) returns an index
ordering when the count values are sorted in a descending manner; this is used to
rearrange the rows, but not the columns, of the data frame.
return(sort.dictionary.df) returns the sorted values as a data frame.
4
We now create a dictionary for the full dataset by concatenating the individual lists into
a single large one for all emails, and then using the large list to create a dictionary:
all.emails <- c(ham.train,ham.test,spam.train,spam.test)
dictionary <- make.sorted.dictionary.df(all.emails)
Step 4: Make dictionary counts for all subsets. We are now going to create counts for each
dictionary word in all documents and place them in a document (rows) by word (columns)
matrix. Copy the following script. to NaiveBayesFiles.R:
#==============================================================
# Make a document-term matrix, which counts the number of times each
# dictionary element is used in a document
make.document.term.matrix <- function(emails,dictionary){
# This takes the email and dictionary objects from above and outputs a
# document term matrix
num.emails <- length(emails);
num.words <- length(dictionary$word);
# Instantiate a matrix where rows are documents and columns are words
dtm <- mat.or.vec(num.emails,num.words); # A matrix filled with zeros
for (i in 1:num.emails){
num.words.email <- length(emails[[i]]);
email.temp <- emails[[i]];
for (j in 1:num.words.email){
ind <- which(dictionary$word == email.temp[j]);
dtm[i,ind] <- dtm[i,ind] + 1;
}
}
return(dtm);
}
# Example: dtm <- make.document.term.matrix(emails.train,dictionary)
#==============================================================
In many cases, producing a full document-term matrix is a poor idea. It requires a
large amount of storage space for relatively little information. In settings with large corpora
or large dictionaries, it is often best to store the data in the form. of tokens that simply
denote document number, word number in the dictionary, and count for that word-document
combination. Only counts of at least 1 are stored, yielding something like this:
However, matrix representation greatly simpli es the coding required for implementing
a naive Bayes classi er. The function is structured as follows:
num.emails <- length(emails) calculates the number of emails by computing the
length of the list.
num.words <- length(dictionary) calculates the length of the dictionary.
5
document word count
3 1 2
3 5 1
3 11 2
3 12 2
... ... ...
dtm <- mat.or.vec(num.emails,num.words) instantiates a matrix with num.emails
rows and num.words columns that is lled with zeros. Note that mat.or.vec(num.emails)
would instantiate a vector with num.emails entries that is lled with zeros.
for (i in 1:num.emails)f loops through each of the emails.
num.words.email <- length(emails[[i]]) calculates the number of words in email
i.
email.temp <- emails[[i]] temporarily stores the vector email i in email.temp.
for (j in 1:num.words.email)f loops through each of the words in email i.
ind <- which(dictionary == email.temp[j]) stores the dictionary index equal to
the jth word of email i.
dtm[i,ind] <- dtm[i,ind] + 1 adds a count of 1 to to the observation matrix for
word ind in email i.
return(dtm) returns the document-term matrix.
We now create a document term matrix for each group of emails. Since we are looping
over a large space in R, this may take a few minutes to run:
dtm.ham.train <- make.document.term.matrix(ham.train,dictionary$word)
dtm.ham.test <- make.document.term.matrix(ham.test,dictionary$word)
dtm.spam.train <- make.document.term.matrix(spam.train,dictionary$word)
dtm.spam.test <- make.document.term.matrix(spam.test,dictionary$word)
Step 5: Create a naive Bayes classi er. All of your data is now in matrix form. We are
now going to use it to compute naive Bayes probabilities. Computationally, we may ignore
the normalizer p(xtest) can be ignored in both Equations (??) and (??) since we only need
to determine which class has a higher likelihood. A test document is declared spam if
and ham otherwise. Therefore, we only need to estimate the probabilities p(y = 1), p(y = 0),
^p(y = 0) = ^p(y = 1) = 0:5 because the number of spam documents is equal to the number
of ham documents in both the training and test sets.
When computing probabilities de ned by products, like those in Equations (??) and (??),
it is more convenient to work in log probabilities. Consider a dictionary with two words, each
with equal probability given that the document is spam, and a test document with 100
words. Then,
p(y = 1jxtest) = 12 100Yj=112 =12101
3:9 10 31:
While best case for a small document, this is way beyond machine precision. If we compute
the log probability,
This can easily be stored within machine precision. Even when computing a vector of
probabilities, it is often a good idea work in log probabilities. Note that programs like R or
Matlab use base e.
To compute a vector of log probabilities from a document term matrix with phantom
examples, copy the following code into NaiveBayesFiles.R:
#==============================================================
make.log.pvec <- function(dtm,mu){
# Sum up the number of instances per word
pvec.no.mu <- colSums(dtm)
# Sum up number of words
n.words <- sum(pvec.no.mu)
# Get dictionary size
dic.len <- length(pvec.no.mu)
# Incorporate mu and normalize
log.pvec <- log(pvec.no.mu + mu) - log(mu*dic.len + n.words)
return(log.pvec)
}
#==============================================================
The following commands were introduced in this code:
sum(pvec.no.mu) sums over all values in the vector pvec.no.mu.
colSums(dtm) sums over the rows of the matrix dtm, producing a vector with the
number of elements equal to the number of columns in dtm.
pvec.no.mu + mu adds the scalar value mu to all elements of the vector pvec.no.mu.
(pvec.no.mu + mu)/(mu*dic.len + n.words) divides all elements of the vector pvec.no.mu
+ mu by the scalar value mu*dic.len + n.words.
7
Question 1. (30 points) Write a function to compute log (^p(y = ~yjxtest)) and dis-
criminate between ham and spam, \naive.bayes < function(log.pvec.spam, log.pvec.ham,
log.prior.spam, log.prior.ham, dtm.test)". This takes the log probabilities for the dictionary
in a spam document (log.pvec.spam), a ham document (log.pvec.ham), the log priors
(log.prior.spam, log.prior.ham) and a document term matrix for the testing corpus.
For this question submit the naive Bayes classi er code:
Paste the function naive.bayes into your homework for this question.
Question 2. (15 points) What are the to 10 most popular words in the entire corpus?
What are their counts? What are their counts in the ham training set and the spam training
set?
Question 3. (20 points) Set = 1jDj and use the training set for training, and the
testing set for testing. What proportion is classi ed correctly? Additionally, report the
proportion of your: true positives (spam classi ed as spam divided by the total amount of
testing spam), true negatives (ham classi ed as ham divided by the total amount of testing
ham), false positives (ham classi ed as spam divided by the total amount of testing ham)
and false negatives (spam classi ed as ham divided by the total amount of testing spam).
Question 4. (30 points) Naive Bayes has a tunable parameter, . We will use 5-fold
cross-validation to search the parameter set = f 1100 1jDj; 110 1jDj; 1jDj;10 1jDj;100 1jDjg. Create
three matrices to store proportion correctly classi ed, proportion of false negatives, and
proportion of false positives. These should be 5 5 matrices.
a. (10 points) For each value of , estimate the correct classi cation rate, the false nega-
tive rate and the false positive rate using 5-fold cross-validation. Use ham.train[1:70],
spam.train[1:70], ham.train[71:140], spam.train[71:140], :::, ham.train[281:350],
spam.train[281:350] as the folds; these values maintain the same proportion of ham
and spam as the original training set. Summarize your results in three graphs.
b. (5 points) What seems to be the best value for ? Why?
c. (10 points) For each value of , train on the full training set an test on the full testing set.
Summarize the correct classi cation rate, the false negative rate and the false positive
rate in three graphs. Does your answer from b. still seem the best value? Why or why
not?
d. (5 points) How close are the rates estimated from cross-validation to the true rates on the
testing set (give percentage error)? What could account for di erences? Give one way
the di erences between the cross-validation rate estimates and the rates on the training
sets could be minimized.
8
Question 5. (30 points) Finally, we will use feature selection to remove features that are
irrelevant to classi cation. Instead of calculating the probability over the entire dictionary,
we will simply count the number of times each of the n most relevant features appear and
treat the set of features themselves as a dictionary.
a. (15 points) A common way to select features is by using the mutual information between
feature xk and class label y. Mutual information is expressed in terms of entropies,
I(X;Y) = H(X) H(XjY) = H(Y) H(YjX):
Show that
Ik =
1X
~y=0
p(xtest = kjy = ~y)p(y = ~y) log p(x
test = kjy = ~y)
p(xtest = k)
+ 1 p(xtest = kjy = ~y) p(y = ~y) log 1 p(x
test = kjy = ~y)
1 p(xtest = k) :
b. (15 points) Compute the mutual information for all features; use this to select the
top n features as a dictionary. Set = 1n. Compute the proportion classi ed cor-
rectly, the proportion of false negatives, and the proportion of false positives for n =
f200;500;1000;2500;5000;10000g. Display the results in three graphs. What happens?
Why do you think this is?