STAT-2450 Assignment2
Name: *** , Student ID: B00***
Read first:
• Please upload your assignment to Brightspace before the due date Sunday 5pm, 25th
Feb 2018.
• Submit your assignment in PDF format and name it as A2_YourName_B00##.pdf
• R should be used exclusively and the R code must be provided in your assignment.
• Show the results clearly; the code must be executable and comment your code.
• Please remember that all of the work must be your own, and answers must be given in
your own words.
• Let me know if you run into any trouble.
Q1 - Simple Linear Regression
Make sure to use set.seed() with the last three digits of your student ID to create the
simulation data.
[1 point] (a) Using the runif() function, create a vector, x, containing 100 observations
drawn uniformly between 1 and 3. This represents a predictor, X.
[1 point] (b) Using the rnorm() function, create a vector of the error term, eps, containing
100 observations drawn from a normal distribution with mean 0 and variance 0.25.
[2 points] (c) Using x and eps, generate a vector y according to the model
. What is the length of the vector y? What are the values of and in this linear model?
[2 points] (d) Fit a least squares linear regression model to predict y using x. Comment on
the model obtained.
[3 points] (e) Create a scatterplot between x and y, and also add the population (true)
regression line in (c) using a solid line and the least squares line you fitted in (d) using a
red line on the plot. Use the legend() command to create an appropriate legend.
[3 points] (f) Repeat (a)–(e) after modifying the data generation process in such a way that
there is more noise in the data (change the variance of the error term, eps, from 0.25 to 4).
Compare the population regression line, the old least squares line in (d) and the new least
squares regression line. Describe your results.
[3 points] (g) What are the confidence intervals for and based on the original data set
and the noiser data set in (f)? How does the variance of error term affect the CI? Comment
on your results.
Q2 - Multiple Linear Regression
This question needs the Boston data set in package MASS. Use medv column as the response,
and other variables are the predictors. You can use ?Boston to see the details of this dataset.
[2 point] (a) Create a new data set Boston2, which is a subset of Boston with the columns:
medv, zn, dis, lstat. How many rows and columns in the Boston2 data set. We only use this
Boston2 data set to do all the following questions.
[1 point] (b) Describe briefly what these 4 variables chosen in our data set Boston2
represent. (hint: ?Boston)
[2 points] (c) For the predictor lstat, fit a simple linear regression model to predict the
response medv. Comment on the model obtained. Is there a statistically significant
relationship between the predictor and the response? Is the relationship positive or
negtive?
[2 points] (d) Create the residual plot for the simple linear regression model fitted above
(x-axis: fitted y; y-axis: residuals). According to the residual plot, do you think the
relationship between lstat and response medv is linear or non-linear? Explain.
[2 points] (e) For the predictor lstat, fit a model with polynomial terms:
. According to your results, is there evidence of non-
linear association between medv and lstat? Comment on your results.
[3 points] (f) Fit a multiple regression model using all 3 predictors. Describe your results.
For which predictors can we reject the null hypothesis ?
[2 points] (g) Given a new data point (zn = 12, dis = 4, lstat = 15), predict the response
medv using the multiple regression model fitted in (f).
[3 points] (h) Calculate and compare the 95% confidence and prediction intervals for this
new data point.
Q3 - KNN classifier
Use the Boston2 data set to do the following questions (response: medv; predictors: zn, dis,
lstat).
[3 points] (a) Describe the procedure of the KNN algorithm briefly in your own words.
[1 point] (b) Create a function that can normalize the input data by
. (hint:
input should be a vector)
[2 point] (c) Use the function created in (b) to normalize the predictor columns zn, dis, lstat.
[1 point] (d) Categorize the column medv, that contains "high" if medv > 20 and "low" if the
value <= 20.
[1 point] (e) Show the summaries of this transformed data set. (hint: summary())
[2 point] (f) Randomly split the data set into training data (80%) and test data (20%). (hint:
set a random seed before split)
[3 points] (g) Use the train data set to learn a knn classifier (k = 5) and predict the medv on
the test data. How many observations in test data you predict correctly? Calculate the test
misclassification rate. (hint: knn() function from class library)
[5 points] (h) Perform. this knn classification using different Ks ( k = 1, 5, 10, 20) and
calculate the corresponding test misclassification rates. According to the test error rate,
which is the best k you would like to choose and explain.