调试R、Statistics设计辅导留学生、辅导R编程、R语言讲解留学生、解析R编程

Statistics 108, Project 2, Due: December 4th by 5pm
Turn in the report in electronic form. (word or pdf) through Canvas
Instruction: This project is to analyze a dataset, from start to nish, based on the multiple linear
regression model. It is an individual project. Students could discuss with each other to get better
understanding of the project. Copying solutions or computing codes from other students or other
sources is plagiarism. At a minimum, all students involved will receive a 0 on this project for any
type of academic dishonesty.
R codes: Attach the entire R codes you used to analyze the data at the end of the report.
Data description: The data \diabetes.txt" contains 16 variables on 366 subjects who were in-
terviewed in a study to understand the prevalence of obesity, diabetes, and other cardiovascular
risk factors in central Virginia for African Americans. We will consider building regression models
with glyhb as the response variable as Glycosolated Hemoglobin > 70 is often taken as a positive
diagnostics of diabetes. The goal is to nd the \best" model for later use.
Data exploration and split data for validation later on.
1. Among all the variable, which of the variables are quantitative variables? Which are qualita-
tive variables? Draw histogram for each quantitative variable and comment on its distribution.
Draw pie chart for each qualitative variable and comment on how its classes are distributed.
Draw scatterplot matrix and obtain the pairwise correlation matrix for all quantitative vari-
ables in the data. Comment on their relationships.
2. Regress glybh on all predictor variables (Model 1). Draw the diagnostic plots of the model
and comment.
3. You want to check whether any transformation on the response variable is needed. You use
the function ‘boxcox’ to help you make the decision. State the transformation you decide to
use. In the following, we denote the transformed response variable to be glyhb . Regress
glyhb on all predictor variables (Model 2). Draw the diagnostic plots of this model and
comment. Apply boxcox again on Model 2; what do you nd?
4. Randomly split data into two equal halves: a training data set and a validation data set.
Selection of rst-order e ects. We now consider subsets selection from the pool of all rst-order
e ects of the 15 predictors. glyhb* is used as the response variable for the following problems.
5. Fit a model with all rst-order e ects (Model 3). How many regression coe cients are there
in this model? What is the MSE from this model?
6. Consider best subsets selection using the R function regsubsets() from the leaps library
with Model 3 as the full model. Return the top 1 best subset of all subset sizes (i.e., number
of X variables) up to 16 (because frame. has 3 levels). Get SSEp,R2p, R2a;p, Cp, AICp,BICp for
each of these models, as well as the none-model (the model with only an intercept). Identify
the best model according to each criterion. For the best model according to Cp criterion,
what do you observe about its Cp value? Do you have a possible explanation for it?
Denote the best models according to AIC, BIC, and adjusted R2 be Model 3.1, Model 3.2,
Model 3.3, respectively. (It is possible that some of the three models are the same.)
1
Selection of rst- and second- order e ects. We now consider subsets selection from the pool
of rst-order e ects as well as 2-way interaction e ects of the 15 predictors.
7. Fit a model with all rst-order and 2-way interaction e ects (Model 4). How many regression
coe cients are there in this model? What is the MSE from this model? Do you have any
concern about the tting of this model and why?
8. Apply the forward stepwise procedure using R function step() (or stepAIC()), starting
from the none-model and using the AICp criterion. What is the model being selected? Denote
this model by Model fs1. Compare its AIC value with that of Model3.1. What do you nd?
9. Apply the forward stepwise procedure using R function step() (or stepAIC()), starting
from the full model (Model 3) and using the AICp criterion. What is the model being selected?
Denote this model by Model fs2. Compare its AIC value with that of Model fs1. What do
you nd?
10. Compare the BIC values of Model fs1 and Model fs2. What do you nd? Do AIC and BIC
choose the same model among these two models or not? Denote the model selected by AIC
among the two models by Model 4.1 and that selected by BIC be Model 4.2. (It is possible
that Model 4.1 and Model 4.2 are the same model.)
Model validation. We now consider validation of the models (Model 3.1, Model 3.2, Model 3.3,
Model 4.1, Model4.2) you selected in the previous studies.
11. Internal validation. We use PRESS for this purpose. Calculate PRESS for each of these
models. Comment.
12. External validation using the validation set. For each of these models (Model 3.1, Model
3.2, Model 3.3, Model 4.1, Model4.2), calculate the mean squared prediction error (MSPR),
i.e., you use the model to predict the 183 observations in the validation set and calculate
the averaged squared prediction error. How do these MSPRs compare with the respective
PRSSE=n (here n is the sample size of the training data, i.e., 183). Which model has the
smallest MSPR?
13. Based on both internal and external validation, which model you would choose as the nal
model? Fit the nal model using the entire data set (training and validation combined)
(Model 5). Write down the tted regression function and report the R summary() and
anova() output.