STAT:3200 { Applied Linear Regression, Homework 9
Assigned by Dr. Erning Li on 11/16/2018.
Due at the beginning of class on Friday, November 30, 2018.
Coverage: Multicollinearity, Model selection.
1. A simulation study of the impacts of multicollinearity:
In statistics, Monte Carlo simulation can be used to assess the performance of a method. With
simulations, we know and control the truth. In this simulation, you repeatedly draw independent
samples from an articial (Y; x1; x2; x3) population with known parameters, conduct estimation
and inference on model parameters and empirically check the impacts of multicollinearity.
The response Y is generated from the following model (true model):
Y = 0 + 1x1 + 2x2 + ; where 0 = 150; 1 = 10; 2 = 2; iid N(0; 152):
Hence in the population, x1 and x2 are useful for predicting Y , while x3 isn't (but a surrogate of
x2). The variables x2 and x3 have high correlation coecient 0.95, but x1 is independent from
them. Each simulated sample is t to two regression models:
Model I: lm(Y x1 + x2)
Model II: lm(Y x1 + x2 + x3)
Read the R script le at http://www.stat.uiowa.edu/~ernli/ALRdata/HW9_simulation.R
to see the data generating scheme and multiple linear regressions. Open R and copy/paste this
R script into your R Editor.
Step a): Run the entire script to obtain one simulated random sample and model ts. Copy/paste
the estimate, s.e., t stat, p-value for b1 and b2 from Model I t, and those for b1, b2 and
b3 from Model II t into your answer sheet.
Step b): Repeatedly simulate samples from the same population by repeating the last step (Step a))
for a total of 5 times.
Use your simulation results to answer:
(1) For the ts of Model I,
i) Are b1 and b2 stable and close to their corresponding true parameter values?
ii) Do b1 and b2 have signicant p-values?
(2) For the ts of Model II,
i) Are b1 stable and close to its true parameter value?
ii) Do b1 have signicant p-values?
iii) Are b2 stable and close to its true parameter value?
iv) Do b2 (always) have signicant p-values?
v) How are the standard errors of b2 compared to those from Model I (larger or smaller)?
vi) Are b3 stable and close to its true parameter value?
1
vii) Do b3 (always) have signicant p-values?
(3) Are the two regression models (I and II) reliable for analyzing the data? Brie
y explain.
2. Consider the \Bfox" data set in car library, which contains 30 time series data on Canadian
women's labor-force participation in the rst three decades of the postwar period.
The researchers were expecting: women's earnings (womwage), consumer debt (debt), and the
availability of part-time work (parttime) to aect women's labor-force participation (partic)
positively; while fertility (tfr) and men's earnings (menwage) to have negative eects.
Because all of the series, including that for the dependent variable, manifest strong linear trends
over the 20-year period of the study, we want to account for a linear time eect. Thus, we create
our own time variable for inclusion into the model using the row names of the data which are
the years:
> attach(Bfox)
> time = as.numeric(rownames(Bfox))
> Bfox.new = cbind(Bfox, time)
> names(Bfox.new)
[1] "partic" "tfr" "menwage" "womwage" "debt" "parttime" "time"
(a) Fit the multiple regression model for predicting women's labor participation (partic) from
all the other variables (predictors).
Based on the signs of the regression coecient estimates, the eect of which predictor is
not consistent with what the researchers were expecting?
Are there any partial regression coecients that are insignicant? List them.
## when regressing one response variable on all the other variables as main
## effects (i.e. no interactions), you can use the following short-cut code
> lm.out = lm(partic ~ ., data = Bfox.new)
(b) Plot the scatterplot matrix and check pairwise correlations.
Are some of the correlations very high?
> plot(Bfox.new, pch=16)
> round(cor(Bfox.new), 4) ## round to 4 decimals
(c) 1) Compute the Variance In
ation Factors for the predictors in this model.
Do you have a concern? Brie
y explain.
> vif(lm.out)
2) Manually calculate the VIF for menwage using the R2 from regressing menwage on all
the other predictors.
> summary( lm(menwage ~ tfr + womwage + debt + parttime + time,
data=Bfox.new) )$r.squared
(d) For the above multiple regression model (lm.out), use AIC in a backward stepwise method
to select an \optimal" model. (No need to include the AIC iterations.)
2
Provide a list of the predictors that were removed in the order from rst to last.
Provide a list of predictors that remain in the nal chosen model.
> model.selection = step(lm.out)
(e) Use BIC in the model selection procedure to choose an \optimal" model. (No need to
include the BIC iterations.)
Provide a list of the predictors that were removed in the order from rst to last.
Is the nal model preferred by BIC the same as the one chosen by AIC?
> model.selection.BIC = step(lm.out, k=log(nrow(Bfox.new)))
(f) For the nal model chosen by AIC/BIC, do the following:
1) Fit the nal model. Provide summary output.
Are the partial regression coecients signicant?
2) Compare AIC values between the original model (lm.out) and the nal model.
Which model is better according to AIC?
## check the AIC value of the original model
> AIC(lm.out)
3) Compare BIC values between the original model (lm.out) and the nal model.
Which model is better according to BIC?
## check the BIC of the original model
> BIC(lm.out)
## alternatively
> AIC(lm.out, k=log(nrow(Bfox.new)))
4) Compare Adjusted R2 between the original model (lm.out) and the nal model.
Which model is better according to Adjusted R2?
5) Conduct a partial F test to see if the nal model is sucient in comparison to the original
model (lm.out)? Hint: anova().
6) For this nal model, is there still a concern of multicollinearity? Compute VIF's and
explain.
7) For this nal chosen model, plot Cook's D against observation indices.
Plot the \bubble" plot of studentized residuals against hat values with bubble size
increasing with Cook's D. (Note: Use correct sample size and number of predictors
when determining the appropriate thresholds in your R codes.)
Which observation is unusually most in
uential | Is it most in
uential due to high
leverage and/or large residual?
(g) Compare the following two models:
> model.1 = lm(partic ~ tfr + debt + parttime)
> model.2 = lm(partic ~ menwage + debt + time)
Which model is preferred by AIC? Which model is preferred by BIC?
(h) 1) Regress partic on debt, parttime, and their interaction debt:parttime. Compute the VIF's
and explain if there is a concern of multicollinearity.
3
> fit1 = lm(partic ~ debt + parttime + debt:parttime)
2) Regress partic on centered debt, centered parttime, and the interaction of the two centered
predictors. Compute the VIF's and explain if there is a concern of multicollinearity.
## center debt
> c.debt = debt - mean(debt)
3. In the last assignment, we considered the data \Chirot" in the car library and t the model
> myfit = lm(intensity ~ commerce + tradition + commerce:tradition
+ midpeasant + inequality, data=Chirot)
(a) Compute the VIF's for the predictors in this model.
Do you have a concern? Brie
y explain.
(b) Use AIC to select an \optimal" model. (No need to include the AIC iterations.)
Provide a list of predictors that remain in the AIC-chosen model.
Note: When there are higher-order terms in a model, in the iterations, AIC or BIC
check/report the highest-order term of each predictor rst. For example, midpeasant,
inequality, commerce:tradition (but not commerce or tradition). The reason is if the inter-
action `commerce:tradition' needs to be kept in the model, then the corresponding main
eects `commerce' and `tradition' should automatically remain in the model to ensure model
hierarchy.
(c) Use BIC to select an \optimal" model. (No need to include the BIC iterations.)
Provide a list of predictors that remain in the BIC-chosen model.
Which model is smaller, the one chosen by AIC or the one chosen by BIC?
(d) Provide the Adjusted R2 of the original model,of the AIC-chosen model, and of the BIC-
chosen model, respectively.
Which model is better according to Adjusted R2?
(e) For the AIC-chosen model, do the following:
1) Check the VIF's.
Which predictors have high VIF's?
Do you still have a concern? Brie
y explain.
2) Fit the model when both commerce and tradition are centered.
Compute the VIF's and explain if there is still a concern of multicollinearity.
Note: Provide necessary R code(s), output(s) and/or plot(s) to support your answers to each problem.
Please write up your answers clearly with all the key and necessary steps shown; you may lose points
for skipping/missing key codes, output, or steps. The homework you turn in should be neat and
stapled with your name on the top of each page. You can discuss and work on homework together,
but must do your own programming and write up your own answers. NO blind copying allowed.
The grader may deduct point if you didn't STAPLE your homework.