Final Project: Sentiment Analysis Using
the Movie Review Data
urls.csv contains the urls to movie review pages for 10,547 movies.
Set the random number generator seed to 2 (by using set.seed(2) ), sample 500 movies out of them, and
scrape all reviews for the sampled movies from the review webpages.
Prepare a movie review dataset containing review text, numeric ratings, and ids of users who provide reviews
and ratings (please note that some reviews may have no ratings).
I will use the reviews of two movies to illustrate what I am looking for in your final write-ups for this task.
Thumbtanic: http://www.imdb.com/title/tt0234889/reviews
The values contained in the first column is used as indices of these reviews. The last column contains the text
of the reviews, and is wrapped up due to limited space. Note that the second review has the value of in
the rating column, suggesting that the user did not provide the numeric rating along with his/her textual
review. It could be good practice to export the data to the local disk for reuse once you have the data
organized in the above format.
In text preprocessing, we need to prepare a dataset of terms by stripping white space, converting uppercase
to lowercase, and removing numbers and punctuation as well as English stopwords (use stop_words dataset
in tidytext ) from all the reviews in this collection.
For the purpose of sentiment analysis, we want to keep only the terms that belong to the BING lexicon for
characterizing reviews.
Keep only the reviews whose predictions are opposite to the actual sentiments (i.e., predictions that lead to
misclassification).
Find 5 books that have the largest numbers of misclassified reviews based on the simple-difference
approach, and produce a facet plot with each facet displaying top 10 reviews that have the largest
discrepancies between the predictions and their normalized ratings. The following figure is the facet plot
created for the 2 movies used in this example. Your facet plot should have a similar appearance to the figure
below.
title rating negative positive predict nor_rating
Select one of the reviews displayed in your facet plot and retrieve its content as I do for review 95 here, and
explain why the simple-difference approach does not work for it.
Term Analysis Method
Let us move to using supervised learning to do sentiment analysis. Select the 500 positive reviews (ratings
between 7 and 10) and 500 negative rreviews (ratings between 1 and 4), and combine them into a training set.
For an illustrative purpose, I select 8 positive reviews and 8 negative reviews here.
## [1] "Brutal , Shocking and Artistic.Very few films go under this description, Requ
iem for a dream isa prime example.I have nothing to say about this film, just make su
re you watch this."
Note that the positive column contains the number of positive words in each review, while the negative
column contains the number of negative words in each review. We have 145 additional columns used to keep
the counts of those distinct terms matching the BING lexicon.
We would like to develop a coefficient for each word that captures the correlation with rating and its counts
across the reviews (i.e., coefficient of a term = cor(rating, term)). We can use the sign of the coefficient to label
the sentiment of each term. The below figure gives the comparison between the derived sentiment using such
a coefficient and the BING sentiment.
Find top 20 words that have the largest absolute values for this coefficient and produce a bar chart
similar to the one given below:
## # A tibble: 145 x 4
word coefficient derived_sentiment BING_sentiment
Subset words with coefficients greater than 0.05 in absolute value to define a new lexicon for sentiment
scoring. Keep only words that match the new lexicon and reuse the simple-difference approach to
derive the sentiment scores of the reviews (in the score column).
We can use the training set and tree-structured modeling to determine the cutoff
( install.packages("tree") and library(tree) ).
The following command allows for the printing of the model in the tree structure:
Note that with the cutoff set at -1, we have 0 classificiation error for the training set.
Use the derived lexicon and the cutoff derived from tree-structured modeling on the training set to
classify examples in the test set that consists of 100 positive reviews and 100 negative reviews unused
in the training stage (remember to report revid of these 200 reviews). Report the confusion matrix.
library(tree)
cutoff.tree<-tree(class ~ score, dscore.per.rev)
summary(cutoff.tree)
plot(cutoff.tree)
text(cutoff.tree, pretty = 0)