首页 > > 详细

STAT:3200 Applied Linear Regression

STAT:3200 { Applied Linear Regression, Homework 9
Assigned by Dr. Erning Li on 11/16/2018.
Due at the beginning of class on Friday, November 30, 2018.
Coverage: Multicollinearity, Model selection.
1. A simulation study of the impacts of multicollinearity:
In statistics, Monte Carlo simulation can be used to assess the performance of a method. With
simulations, we know and control the truth. In this simulation, you repeatedly draw independent
samples from an arti cial (Y; x1; x2; x3) population with known parameters, conduct estimation
and inference on model parameters and empirically check the impacts of multicollinearity.
The response Y is generated from the following model (true model):
Y = 0 + 1x1 + 2x2 + ; where 0 = 150; 1 = 􀀀10; 2 = 2;  iid  N(0; 152):
Hence in the population, x1 and x2 are useful for predicting Y , while x3 isn't (but a surrogate of
x2). The variables x2 and x3 have high correlation coecient 0.95, but x1 is independent from
them. Each simulated sample is t to two regression models:
 Model I: lm(Y  x1 + x2)
 Model II: lm(Y  x1 + x2 + x3)
Read the R script le at http://www.stat.uiowa.edu/~ernli/ALRdata/HW9_simulation.R
to see the data generating scheme and multiple linear regressions. Open R and copy/paste this
R script into your R Editor.
Step a): Run the entire script to obtain one simulated random sample and model ts. Copy/paste
the estimate, s.e., t stat, p-value for b 1 and b 2 from Model I t, and those for b 1, b 2 and
b 3 from Model II t into your answer sheet.
Step b): Repeatedly simulate samples from the same population by repeating the last step (Step a))
for a total of 5 times.
Use your simulation results to answer:
(1) For the ts of Model I,
i) Are b 1 and b 2 stable and close to their corresponding true parameter values?
ii) Do b 1 and b 2 have signi cant p-values?
(2) For the ts of Model II,
i) Are b 1 stable and close to its true parameter value?
ii) Do b 1 have signi cant p-values?
iii) Are b 2 stable and close to its true parameter value?
iv) Do b 2 (always) have signi cant p-values?
v) How are the standard errors of b 2 compared to those from Model I (larger or smaller)?
vi) Are b 3 stable and close to its true parameter value?
1
vii) Do b 3 (always) have signi cant p-values?
(3) Are the two regression models (I and II) reliable for analyzing the data? Brie
y explain.
2. Consider the \Bfox" data set in car library, which contains 30 time series data on Canadian
women's labor-force participation in the rst three decades of the postwar period.
The researchers were expecting: women's earnings (womwage), consumer debt (debt), and the
availability of part-time work (parttime) to a ect women's labor-force participation (partic)
positively; while fertility (tfr) and men's earnings (menwage) to have negative e ects.
Because all of the series, including that for the dependent variable, manifest strong linear trends
over the 20-year period of the study, we want to account for a linear time e ect. Thus, we create
our own time variable for inclusion into the model using the row names of the data which are
the years:
> attach(Bfox)
> time = as.numeric(rownames(Bfox))
> Bfox.new = cbind(Bfox, time)
> names(Bfox.new)
[1] "partic" "tfr" "menwage" "womwage" "debt" "parttime" "time"
(a) Fit the multiple regression model for predicting women's labor participation (partic) from
all the other variables (predictors).
Based on the signs of the regression coecient estimates, the e ect of which predictor is
not consistent with what the researchers were expecting?
Are there any partial regression coecients that are insigni cant? List them.
## when regressing one response variable on all the other variables as main
## effects (i.e. no interactions), you can use the following short-cut code
> lm.out = lm(partic ~ ., data = Bfox.new)
(b) Plot the scatterplot matrix and check pairwise correlations.
Are some of the correlations very high?
> plot(Bfox.new, pch=16)
> round(cor(Bfox.new), 4) ## round to 4 decimals
(c) 1) Compute the Variance In
ation Factors for the predictors in this model.
Do you have a concern? Brie
y explain.
> vif(lm.out)
2) Manually calculate the VIF for menwage using the R2 from regressing menwage on all
the other predictors.
> summary( lm(menwage ~ tfr + womwage + debt + parttime + time,
data=Bfox.new) )$r.squared
(d) For the above multiple regression model (lm.out), use AIC in a backward stepwise method
to select an \optimal" model. (No need to include the AIC iterations.)
2
Provide a list of the predictors that were removed in the order from rst to last.
Provide a list of predictors that remain in the nal chosen model.
> model.selection = step(lm.out)
(e) Use BIC in the model selection procedure to choose an \optimal" model. (No need to
include the BIC iterations.)
Provide a list of the predictors that were removed in the order from rst to last.
Is the nal model preferred by BIC the same as the one chosen by AIC?
> model.selection.BIC = step(lm.out, k=log(nrow(Bfox.new)))
(f) For the nal model chosen by AIC/BIC, do the following:
1) Fit the nal model. Provide summary output.
Are the partial regression coecients signi cant?
2) Compare AIC values between the original model (lm.out) and the nal model.
Which model is better according to AIC?
## check the AIC value of the original model
> AIC(lm.out)
3) Compare BIC values between the original model (lm.out) and the nal model.
Which model is better according to BIC?
## check the BIC of the original model
> BIC(lm.out)
## alternatively
> AIC(lm.out, k=log(nrow(Bfox.new)))
4) Compare Adjusted R2 between the original model (lm.out) and the nal model.
Which model is better according to Adjusted R2?
5) Conduct a partial F test to see if the nal model is sucient in comparison to the original
model (lm.out)? Hint: anova().
6) For this nal model, is there still a concern of multicollinearity? Compute VIF's and
explain.
7) For this nal chosen model, plot Cook's D against observation indices.
Plot the \bubble" plot of studentized residuals against hat values with bubble size
increasing with Cook's D. (Note: Use correct sample size and number of predictors
when determining the appropriate thresholds in your R codes.)
Which observation is unusually most in
uential | Is it most in
uential due to high
leverage and/or large residual?
(g) Compare the following two models:
> model.1 = lm(partic ~ tfr + debt + parttime)
> model.2 = lm(partic ~ menwage + debt + time)
Which model is preferred by AIC? Which model is preferred by BIC?
(h) 1) Regress partic on debt, parttime, and their interaction debt:parttime. Compute the VIF's
and explain if there is a concern of multicollinearity.
3
> fit1 = lm(partic ~ debt + parttime + debt:parttime)
2) Regress partic on centered debt, centered parttime, and the interaction of the two centered
predictors. Compute the VIF's and explain if there is a concern of multicollinearity.
## center debt
> c.debt = debt - mean(debt)
3. In the last assignment, we considered the data \Chirot" in the car library and t the model
> myfit = lm(intensity ~ commerce + tradition + commerce:tradition
+ midpeasant + inequality, data=Chirot)
(a) Compute the VIF's for the predictors in this model.
Do you have a concern? Brie
y explain.
(b) Use AIC to select an \optimal" model. (No need to include the AIC iterations.)
Provide a list of predictors that remain in the AIC-chosen model.
Note: When there are higher-order terms in a model, in the iterations, AIC or BIC
check/report the highest-order term of each predictor rst. For example, midpeasant,
inequality, commerce:tradition (but not commerce or tradition). The reason is if the inter-
action `commerce:tradition' needs to be kept in the model, then the corresponding main
e ects `commerce' and `tradition' should automatically remain in the model to ensure model
hierarchy.
(c) Use BIC to select an \optimal" model. (No need to include the BIC iterations.)
Provide a list of predictors that remain in the BIC-chosen model.
Which model is smaller, the one chosen by AIC or the one chosen by BIC?
(d) Provide the Adjusted R2 of the original model,of the AIC-chosen model, and of the BIC-
chosen model, respectively.
Which model is better according to Adjusted R2?
(e) For the AIC-chosen model, do the following:
1) Check the VIF's.
Which predictors have high VIF's?
Do you still have a concern? Brie
y explain.
2) Fit the model when both commerce and tradition are centered.
Compute the VIF's and explain if there is still a concern of multicollinearity.
Note: Provide necessary R code(s), output(s) and/or plot(s) to support your answers to each problem.
Please write up your answers clearly with all the key and necessary steps shown; you may lose points
for skipping/missing key codes, output, or steps. The homework you turn in should be neat and
stapled with your name on the top of each page. You can discuss and work on homework together,
but must do your own programming and write up your own answers. NO blind copying allowed.
The grader may deduct point if you didn't STAPLE your homework.
 

联系我们
  • QQ:99515681
  • 邮箱:99515681@qq.com
  • 工作时间:8:00-21:00
  • 微信:codinghelp
热点标签

联系我们 - QQ: 99515681 微信:codinghelp
程序辅导网!