Homework 4 Part 1
APANPS4335: Machine Learning
Due March 20 by 11:55pm (Total: 75 points)
Directions: please submit your homework as two files — .Rmd and .pdf — on the Canvas class website.
Include short explanations of your code and results throughout.
1 Subset Selection and Shrinkage Methods
Answer true or false for each statement and explain why.
A. In the past, performing cross-validation was often computationally prohibitive for problems that had
many predictors and/or a large number of observations. AIC, BIC, adjusted R2 or Cp were often used
instead.
B. Ridge Regression will shrink all coefficients toward zero, but it will not set any of them exactly equal to
zero.
C. If the number of parameters (p) in a dataset exceeds the number of observations (n) (that is, p > n),
then you cannot use ordinary least squares to fit the data.
D. Ridge regression models are usually more “interpretable” than Lasso regression models.
E. In logistic regression we can also use subset selection ordering by RSS (residual sum of squares) as we
do with linear regression.
F. Backward stepwise selection is an appealing alternative for feature selection, but it has computational
limitations.
G. LASSO, unlike Ridge Regression, can also be used for model selection.
H. The idea behind adjusted R2 is that once all the correct variables have been included in the model,
adding noise variables will reduce RSS only slightly. Adjusted R2 penalizes a model for excessive
variables.
I. Ridge regression works best in situations where the least squares estimates have high variance.
J. When lambda becomes very large, LASSO gives the least squares fit.
2 Best Subset Selection:
Several methods can be used for selecting subsets of predictors. Forward stepwise selection, backward stepwise
selection, and best subset selection are among the best known.
Here we are going to look at best subset selection. This computes the ‘best’ model for each parameter size. It
will automatically compute the optimum five coefficients for a model where p = 5, the ‘best’ 6 coefficients
when p = 6, and so on.
Here we’ll work with some simulated data. We have 30 parameters and 1000 observations. You can copy and
paste this (below) in an RMD code chunk.
#HW4.1
set.seed(2)
# p = parameters = number of predictors
1
p = 30
# n = number of observations
n = 1000
#input
x = matrix(rnorm(n*p), nrow=n, ncol = p)
# Creating our coefficients
betas<-rnorm(p)
# We set 10 of our coefficients equal to zero
null.betas<-c(3,7,9,13,19,23,26,27,28,29)
betas[betas=null.betas]<-0
# Our betas
betas
## [1] 1.7968959 -1.1212405 0.0000000 -1.2278183 1.9297196 -0.2735910
## [7] 0.0000000 -0.8399482 0.0000000 0.9907501 1.1973154 1.8470455
## [13] 0.0000000 -0.5553215 -0.8247737 -0.1866437 -1.0133564 -0.9260824
## [19] 0.0000000 1.7593411 1.3552145 0.8680076 0.0000000 -0.4286713
## [25] 2.1509114 0.0000000 0.0000000 0.0000000 0.0000000 -0.9235582
# We add some noise to our data
err = rnorm(n)
# y our response variable:
y <-x%*%betas + err
# Split data into training and test sets
train = sample(seq(1000), 200, replace = FALSE)
y.train = y[train, ]
y.test = y[-train, ]
x.train = x[train, ]
x.test = x[-train, ]
You’ll notice in the list of betas that 10 are zeros.
Using regsubsets() in R’s “leaps” library, compute “best subsets” on your simulated training data (use x.train
and y.train. You’ll find it useful to read Lab 6.5.1 in ISL for this section.) This code will get you going....
library(leaps)
## Warning: package quotesingle.ts1leapsquotesingle.ts1 was built under R version 3.3.2
regfit.full = regsubsets(y ~ ., data = data.frame(x = x.train, y = y.train), nvmax = p)
# computing mse for each p value
train.errors = rep(NA, p)
# Providing names to variables for later identification
x_cols = colnames(x, do.NULL = FALSE, prefix = "x.")
# computing quotesingle.ts1bestquotesingle.ts1 coefficients for 30 different model sizes
# i.e., best coefficients for a model with one predictor, model
# with two variables, model with three variables....30 in all.
for (i in 1:p) {
coefi = coef(regfit.full, id = i)
2
pred = as.matrix(x.train[, x_cols %in% names(coefi)]) %*% coefi[names(coefi) %in%
x_cols]
train.errors[i] = mean((y.train - pred)^2)
}
2.1 Best size (p) (training)
Which model size (p) gave you the lowest mean squared error (mse) on training data? That is, what model
size was ‘best of the best.’
2.2 Best MSE (training)
What was the mse for the ‘best of the best’ on training data?
2.3 Plot (training)
Create a plot where your x axis is the number of variables (p) in the model, and the y axis is the MSE on
training data. You can use R’s plot() function, or ggplot if you like. Identify with a red point where mse was
minimized.
2.4 Coefficients: training
List the coefficients of the variables selected for your “best” overall model.
2.5 Null betas: training
How many of the variables that had zero betas in our (true) simulated data made it into your ‘best of the
best’ subsets model immediately above?
You’ll recall that these were the variables with null coefficients: null.betas<-c(3,7,9,13,19,23,26,27,28,29). Did
they make the cut? That is, were any of them included in the ‘best of the best’ model above?
2.6 Best size (p): test data
Now we do the same with test data. Which model size (p) generated the lowest mean squared error (mse) on
test data?
2.7 Best MSE (test data)
Compute the mse for the ‘best of the best.’
2.8 Plot (test data)
Create a plot where your x axis is the number of variables (p) in the model, and the y axis is the MSE using
test data.
3
2.9 Coefficients (test data)
List the coefficients of the variables selected for the “best” overall model on test data.
2.10 Null betas (test data)
How many of the variables with zero betas in our (true) simulated data were chosen in your ‘best of the best’
subsets model immediately above? You’ll recall that these were the variables with null coefficients:
null.betas<-c(3,7,9,13,19,23,26,27,28,29)
2.11 Conclusions
From your computations, lists and plots can you make any generalizations about training data versus test
data with regard to model selection?
3 Gene Expression
Gene expression is the conversion of information encoded in a gene into protein or RNA structures. Here
we will use gene expression data prostate cancer to classify whether a specimen is cancerous or not. The
covariates are 2135 gene microarray expression levels for 102 patients. The response is “cancer” or “noncancer”.
The cleaned dataset “prostate_preprocessed.txt” is in the Homework 4 folder. Place the datasets in your
working directory and use the script. prostate_Microarray.pdf to load the data (also in the Homework 4
folder). Note: the rows and columns are transposed from the usual observations by covariates.
In this section we are going to fit Ridge Regression and Lasso models to the prostate dataset.
With a dataset this large, leave-one-out (LOOCV) cross validation is often too time consuming, so we will
use 10-fold cross validation–primarily to select λ, a tuning parameter that is used in the regularization term
for both Ridge and Lasso.
We’ll begin with Ridge Regression. Use R’s cv.glmnet() function from the glmnet package to fit a ridge
estimator. (You can refer to Lab 6.6 in ISL)
3.1 Best λ (Ridge)
Record the selected value for λ, the tuning parameter, after cross validation.
3.2 Histogram (Ridge)
Plot a histogram of the values for βridge How many of them are not equal to 0?
3.3 Non-zero coefficients (Ridge)
How many non-zero coefficients do you have?
Now use cv.glmnet() to fit a lasso estimator:
4
3.4 Best λ (Lasso)
Record the selected value for λ after cross validation.
3.5 Histogram (Lasso)
Plot a histogram of the values for βlasso
3.6 Non-zero coefficients (Lasso)
How many of the betas are not equal to 0?
Now use these models to generate the percentage misclassified on the testing and training sets.
3.7 Misclassification: Ridge (training data)
Record the misclassification values for Ridge on training data.
3.8 Misclassification: Ridge (test data)
Record the misclassification values for Ridge on test data.
3.9 Misclassification: Lasso (training data)
Record the Lasso training misclassification rate.
3.10 Misclassification: Lasso (test data)
Record the Lasso test misclassification rate.
3.11 Best predictor: Ridge v. Lasso
Which is a better predictor? Why?
3.12 Best of Four
When would you want to use:
• Ridge regression
• Lasso regression
• Ordinary least squares linear regression
• k nearest neighbors
Consider both predictive accuracy and interpretability.
5
4 Decision Trees
Tree-based methods are relatively simple and easy to interpret. They are also versatile and can be used for
both regression and classification. They basically involve stratifying or segmenting the predictor space into a
number of simple regions.
In this section we’ll use the diabetes.data.txt data found in the Homework 4 folder in Canvas. For each of
the 442 patients in the this dataset, 10 baseline variables were measured: age, sex, body mass index, average
blood pressure and 6 blood serum measurements. The response, Y, is a quantitative measure of the disease
one year after the baseline measurements were taken. The goal is to predict this measure given the baseline
data.
Load the diabetes dataset.
4.1 Split the data
Split the data into a training set and a test set.
4.2 Fit regression tree
Fit a regression tree to the training set. You can use the rtree() function in R’s “rpart” package. Plot the
tree and interpret the results. Which variables were used in tree construction? How many terminal nodes do
you have?
4.3 RMSE (training)
Compute root mean squared error (MSE) on training data.
4.4 RMSE (test)
Compute root mean squared error (MSE) on test data.
4.5 Prune tree
Use cross validation to determine the optimal level of tree complexity. Prune your tree. You can use the
ctree() function in R’s “party” package. Plot the pruned tree. Which variables did you use in your pruned
tree? How many terminal nodes are there now?
4.6 Root MSE (pruned tree)
Compute the root mean squared error (RMSE) for your pruned tree using test data.
4.7 Random Forests
Analyze the data using random forests. Install R’s “randomForest” package. Set a seed. Fit random forests
to the diabetes data. (See page 329 in ISL.) Compute RMSE.
6
Please note: A key setting in random forests is the M tuning parameter (“mtry” in R). M is the number
of variables randomly sampled at each split. Its default setting in R is sqrt(p), where p is the number of
predictor variables in the dataset.
Try several different M settings. Does RMSE change? Which M setting gives you your lowest root RMSE?
4.8 Most important variables
What are the three most important variables for predicting diabetes with your random forests model. (You
can use the importance() function in the randomForest package.)