AD699: Data Mining for Business Analytics
Fall 2018
Homework #3
Topic: Classification
Due by 11:59 p.m. on Thursday, 01NOV
Overall purpose: To provide you with hands-on exposure to the k-nearest neighbors and naive bayes
methodologies for model-building. The goal here is not tonecessarilybuildthe *best* model (thoughif you
wishto, youcaniterate throughvariouscombinationsofpredictorstotry),butinsteadtogiveyoumeaningful,
first-hand exposure to what these algorithms are and how they might be used.
Your resources: Youhave manyavailable resources tohelpyouprepare this assignment. Amongthemare:
Twohelpsessions (this assignment spans twoThursdays, andoneither date, youcanfindthe Professor and
the TAinFLR-267from1-3p.m.); the course textbook, whichhas example code for manyofthetasksthat
you might needtoperformhere; the AD699VideoLibrary, whichcansometimes be of help(andtowhich
content can gladly be added upon request); your peers (by all means, share what youlearn...but remember,
your seedvalue is unique andI can‘spot-check’ code thatlookslikeit’sbeencopied);theInternet(whichcan
be a wonderfully-helpful source of answers to questions that start with “Howcan I…” Lastly, I amalso
available via e-mail…if your questionis of the “Whydoesn’t mycode work?”variety, expect a response from
me askingyoutosendme your script. (if youwant tosavethatback-and-forth,justsendyourscriptalongwith
the first e-mail).
Also keep in mind:
** The Missouri Rule. If you get a syntax error along the way? Describe what you’re trying to do with
that step.
** Be sure to check the bullet points in the assignment folder as you go.
Task #1: k-nearest neighbors
The goal of this task is to classify people based on the way that they would answer the question:
“Is it rude to recline your seat on a plane?”
Step 1: Download the file ‘flying-etiquette.csv’ from our class Blackboard site.
Step 2: Using the read.csv() function, bring this file into your R environment. Show the step(s) that
you used to do this.
Step 3: Data Preparation. Convert any blank values in your dataset to “NA.” Show the step(s) that
you used to do this. For any rows with “NA” answers to the question about whether it is rude
to recline your seat on a plane, delete those rows entirely. Show the step(s) that you used to do
this.
Step 4: Select any five predictors from the dataset to use for this classification task. There are not
necessarily any right or wrong answers here -- but for each predictor that you select, write one
sentence that explains why you selected it. There is no R code needed for this step. [You can
move quickly through this -- just choose five and move on...but in real-life, variable selection is
the most important step in creating a machine learning model].
Step 5: Create a fictional person, and assign that person characteristics for the five predictors that you
chose to use. There is no R code needed for this step.
Step 6: k-nn uses numerical predictors to classify something that’s a factor. For anypredictors that
you
chose that show numerical ranges, replace those ranges with an average value from within the
range. For predictors with open-ended ranges, replace with a numerical value of your choice.
For any predictors that are categorical, convert them to binary dummies (and if it’s a category
that has more than two possible values, be sure not to use fullRank in the dummyVars
function -- either leave it out, or set it to false). Next, if any values for your predictors are set to
“NA”now, replace themwithanumericvalue(yes,you’reimputinghere,youliar!) Showthe
step(s) that you used to do this.
Step 7: Using your assigned seed value (from Assignment 2), partition your data into training (60%)
and validation (40%) sets. Show the step(s) that you used to do this.
Step 8: Make a dataframe. that contains information for the five predictors for your test subject. Show
the step(s) that you used to do this.
Step 9: Normalize your data using the preProcess() function from the caret package. Use Table 7.2
from the book as a guide for this. Show the step(s) that you used to do this.
Step 10: Using the knn() function from the FNN package, determine a classification for your fictional
person, using a k-value of 7. Show the step(s) that you used to do this, along with the output
in the console.
Step 11: Use your validation set to help you determine an optimal k-value. Use Table 7.3 from the
textbook as a guide here. Show the step(s) that you used to do this, along with the output in
the console.
Step 12: Re-run your knn() function with this new k-value. What result did you obtain? Was it
different from the one you saw in Step 10? Show the step(s) that you used to do this, along
with the output in the console.
Totally, 100%optional -- play around with different combinations of features (predictors) totrytoimprove
upon your model.
Task #2: Naive Bayes
The goal of this task is to classify people based on the way that they would answer the question:
“How do you like your steak prepared?”
Step 1: Download the file ‘steak-risk-survey.csv’ from our class Blackboard site.
Step 2: Convert any blank cells in your dataframe. to “NA”. Show the step(s) that you used to do this.
Step 3: Data preparation. Remove any rows containing NA values from your dataframe. Show the
step(s) that you used to do this.
Step 4: Using your seed value (the same one from Assignment #2) , partition your data into training
(60%) and validation (40%) sets. Show the step(s) that you used to do this.
Step 5: Select five predictors from among all the options available.* For each one you chose,
write one sentence explaining why you think it might be valuable.
Step 6: Build a naive bayes model, with the response variable How.do.you.like.your.steak.prepared.
Show the step(s) that you used to do this.
Step 7: Show a confusion matrix that compares the performance of your model against the training
data, and another that shows its performance against the validation data (just use the
accuracy metric for this analysis). Show the step(s) that you used to do this, along with the
output in the console. How does your model compare with a naive approach for
classification?
Why is this not necessarily a fair comparison? Lastly, how did your training set’s performance
compare with your validation set’s performance?
* Ina different setting, we might spendmoretimeonfeatureselection. However,thepurposehereistoexpose
you to the naive bayes methodology for classification. Totally, 100%optional -- play aroundwithdifferent
combinations of features (predictors) to try to improve upon your model.