linear讲解、R程序设计辅导、讲解R语言、辅导data留学生讲解Processing|讲解R语言编程

Question 1(5 points)
Consider the following underlying linear regression model
yi = 1Xi1+ 2Xi2+ ✏i,
with the standard assumption that E(✏i) = 0, V ar(✏i) = 1,
and Cov(✏i, ✏j)=0
for i 6= j. Note that we don’t include the intercept in this question.
Suppose that you observe.
Answer the following questions using math, i.e., by hand:
(a) (1 points) Suppose that we fit the model:
yi = 1X1i+ 2X2i+ ✏i.
Write down the design matrix X with the given X1 and X2.
(b) (2 points) Fit the model yi = 1X1i+ ✏i, and let ˆ1
be the least square
estimator for 1.
• Derive the mean and variance for ˆ1.
• Is ˆ1
an unbiased estimator for 1?
(c) (2 points) Suppose now that the true underlying model is
yi = 1X1i+ 2X3i+ ✏i,
Suppose that you fit the model
yi = 1X1i+ ✏i
to estimate 1.
• Derive the mean and variance of the least square estimator for 1.
• Is it an unbiased estimator?
2
Question 2 (Total 8 points)
Download the data Q2.csv from Canvas. This is a simulated dataset motivated
by an example in the book Machine Learning with R by Brett Lantz.
The response variable in the dataset is
• charges = medical cost (in dollar) billed by insurance company
The 6 covariates are
• gender = Gender of the primary beneficiary, ‘f’ if female, ‘m’ if male.
• age = Age of the primary beneficiary.
• bmi = Body mass index
• smoker = Smoker or not, ‘yes’ for smoker, ’no’ for non-smoker
• children = Number of dependents, treated as a continuous/numeric covariate
in this problem.
• region = Residential area in the US.
(a) Build a linear model by regressing charges on all the 6 covariates. Answer
the following questions.
(i) Which e↵ects are significant at ↵ = 0.05, and what is the direction of
the e↵ects? Is there a relationship between age and charges?
(ii) Find a 95% confidence interval for the linear coecient
for bmi.
(iii) What’s the R2 and adjusted R2?
(b) Build a reduced model by regressing charges on age, bmi and smoker.
Compare the this model with the full model fitted in part (a) using an F
test. According to the F test, does the model in part (a) fit the model
significantly better than that in part (b)?
Question 3 (Total 12 points)
In class, we mentioned that there are many variable selection methods available.
In this example, we study additional performance metrics, and use simulation
to verify their e↵ectiveness. We will also study and try the stepwise variable
selection for multiple linear regression.
(a) We have learned the adjusted R2 as a metric for the model fit. In this
question, we compare di↵erent models using the adjusted R2.
Suppose the predictor is generated by
set.seed(2020)
n=200
x=rnorm(n)
Remark: To make our results comparable, please use set.seed(2020)
when generating x.
(i) Let the underlying model is generated by
eps=rnorm(n)
y=x+x^2+x^3+eps
What is the underlying model? How many covariates are there in
the underlying model? Please specify the covariates and true linear
coecients.
(ii) Fit 6 di↵erent models: yi = 0+Pp
j=1 jXji + ✏i for p = 1, 2, 3, 4, 5, 6.
These models are polynomials of di↵erent orders. Calculate the adjusted
R2 for each of them, and draw a plot showing the adjusted
R2. (x-axis: p, y-axis: adjusted R2) Does the correct model have the
largest adjusted R2?
(Hint: You can first create a data matrix
X = cbind(x,x^2,x^3,x^4,x^5,x^6)
and then use a for loop to run the 6 regression models, in order to
simplify codes. Also, try summary(model)$adj.r.squared to extract
the adjusted R2 for a fitted model. )
(iii) Instead of using the adjusted R2, there are other performance criteria.
Here, we consider the Akaike Information Criterion (AIC) and
Bayesian Information Criterion (BIC). Read the document at the link
https://daviddalpiaz.github.io/appliedstats/variable-selection-and-model-building.
html.
Alternatively, you can also read page 385-386 of the textbook on ‘selecting
predictor variable from a large set’ and page 705 for the defi-
nition of AIC and BIC.
For this question, write down the definition of AIC and BIC in terms
of Residual Sum of Squares (RSS), n and p.
(iv) In R, AIC and BIC can be computed using the functions AIC and BIC,
respectively. Replace the adjusted R2 by AIC and BIC in part (iii)
and plot the results. Does the correct model have the smallest AIC
and BIC?
(v) Repeat the simulation in (i), (iii) and (iv) for 100 times. You will need
to keep the same x while generating new random eps each time.
Each time, use the adjusted R2 (the largest one), AIC and BIC (smallest
one) to select the model. Take record of the model selected for each
simulation (i.e., take record of the selected p).
Report the frequency that the adjusted R2, AIC, BIC correctly select
the true model among the 100 simulations. For this problem, which
metric selects the model best? For the other metrics, do they tend to
select more covariates or fewer covariates than the correct one?
(b) For multiple linear regression, stepwise selection (including forward search,
backward search, and both directions) is usually used for model selection.
Read Section 16.2 of the document
https://daviddalpiaz.github.io/appliedstats/variable-selection-and-model-building.
html and answer the following questions
(i) Use about two or three sentences to describe stepwise variable selection
methods.
(ii) Download the data Q3.csv and regress Y on X1 - X20. (Hint: try
lm(Y~.,data=Q3)). Use the function step to select variables (use the
default arguments without changing its arguments like k or direction).
Report the selected variables.