辅导R、R程序讲解、讲解R程序、SPSS辅导、讲解留学生SAS语言、R辅导

STA108A winter 2018 Homework 6
Due at the beginning of class on Wednesday February 21, 2018
IMPORTANT INSTRUCTIONS
All homework pages (except the top sheet) must be stapled (before you come to class).
The rst page (‘top sheet’) should contain ONLY your name, student ID, discussion section,
and homework number. Use the format shown below. Do NOT staple to the rest of your
homework.
The second page (which means a new sheet of paper, so not the back side of the rst page)
should ALSO contain your name, student ID, discussion section, and homework number.
Use the format shown below.
After I collect homeworks I put all of the top sheets into a folder before passing the home-
works to the grader. If at any point during the quarter there is a homework that you know
you turned in, but it does not show on Canvas, contact me. I will look to see if I have your
top sheet from the homework. If I have your top sheet then I will give you credit for the
homework.
It is your responsibility to make sure every homework assignment you submit has a top sheet
with your correct discussion section number. If you tell me you turned in a homework, but
there is no Canvas grade and no top sheet in my folder then you will get a 0.
You will not loose any point for not making a top sheet. But if your homework goes missing
you will have no way to prove you turned it in.
On both the rst page (‘top sheet’) and second page write your name and student ID on the
top left, homework number on the top center, and section on the top right. For example,
for homework 6 if your name is John Smith, your student ID is 123456789, and you are in
section A01, then the top of your top sheet and rst stapled page should look like this
John Smith Homework 6 A01
123456789
Points lost if you
don’t follow the rule
correct format for name, ID, homework and section number 1
Staple all pages EXCEPT the top sheet 1
If your homework is on paper pulled out of a notebook,
cut o all of the fringes (from the torn horizontal threads
that attached the paper to the notebook). 1
Please do not turn in your R code with your homework.
Be kind to the grader.
make sure you write your name clearly (so it is easy to read)
write neatly
2
We will use the same data from homework 5. (next part is exactly what is in homework 5,
but here I add a third and fourth model)
The data are from a hypothetical experimental study similar to homework 3 which examines
the relationship between 5 doses of a cholesterol lowering drug and reduction in serum
cholesterol, but with an additional categorical predictor variable exercise. There are three
di erent types of exercise: walk, bike, and jog.
predictor variables
1. drug dose: (50, 55, 60, 65, and 70)
2. exercise: (1=walk, 2=bike, 3=jog)
outcome: cholesterol reduction
(There are some negative y values which means some subjects had an increase in cholesterol.)
This is from a randomized experiment with two replications for each combination of dose
and exercise. This means the predictor variables are balanced, which means that dose and
exercise are independent.
De ne notation
d = dose
a = exercise (1=walk, 2=bike, 3=jog) (a for \activity", since ei is already used for the
residuals)
var(Yjd) = 2Yjd = the variance of Y conditional on dose d
var(Yja) = 2Yja = the variance of Y conditional on exercise a
var(Yjd;a) = 2Yjd;a the variance of Y conditional on dose d and exercise a
model 1: predictor: dose (no exercise e ect)
Yi = 0 + 1xi1 +"i (1)
xi1 = dose for subject i
"i N 0; 2Yjd
model 2: dose + exercise e ects with walk as the baseline group
Yi = 0 + 1xi1 + 2xi2 + 3xi3 +"i (2)
xi1 = dose for subject i
xi2 =
(
1 if subject i in biking group
0 otherwise
xi3 =
(
1 if subject i in jogging group
0 otherwise
"i N 0; 2Yjd;a
3
model 3: dose + exercise e ects with jog as the baseline group
Yi = 0 + 1xi1 + 2xi2 + 3xi3 +"i (3)
xi1 = dose for subject i
xi2 =
(
1 if subject i in walking group
0 otherwise
xi3 =
(
1 if subject i in biking group
0 otherwise
"i N 0; 2Yjd;a
model 4: dose + exercise e ects with means parameterization
Yi = 1xi1 + 2xi2 + 3xi3 + 3xi4 +"i (4)
xi1 = dose for subject i
xi2 =
(
1 if subject i in walking group
0 otherwise
xi3 =
(
1 if subject i in biking group
0 otherwise
xi4 =
(
1 if subject i in jogging group
0 otherwise
"i N 0; 2Yjd;a
The interpretation of the parameters in model 4 are
1 = dose slope
2 = E(Yjwalk and dose = 0)
3 = E(Yjbike and dose = 0)
4 = E(Yjjog and dose = 0)
You can run model 4 in R using the following commands
a1=as.numeric(exercise==1)
a2=as.numeric(exercise==2)
a3=as.numeric(exercise==3)
summary(lm(y ~ -1+dose+a1+a2+a3))
The -1 in the command takes the ‘intercept’ out of the model.
Models 2, 3, and 4 are di erent parameterizations of the same model. We refer to two
di erent models as being the same model if they have the same expected values for every
combination of the predictor variable, which also means they have the same predicted values
for every observation in the dataset.
4
In homework 5 you estimated the parameters in model 2 which uses walking as the
baseline group. The results were
> summary(lm(y~dose+a2+a3))
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -67.8000 33.4009 -2.030 0.05272 .
dose 1.2033 0.5454 2.206 0.03641 *
a2 29.6000 9.4472 3.133 0.00425 **
a3 34.3000 9.4472 3.631 0.00122 **
---
Residual standard error: 21.12 on 26 degrees of freedom
Multiple R-squared: 0.4392,Adjusted R-squared: 0.3745
F-statistic: 6.788 on 3 and 26 DF, p-value: 0.00157
Notes
1. The \Residual standard error" is the square root of the MSE.
2. \The F-statistic: 6.788 on 3 and 26 DF, p-value: 0.00157" is a model comparison F
test where the full model is model 2 and the reduced model is a null model with no
predictor variables. So the reduced model is
Yi = 0 +"i
"i N 0; 2Y
Note that var("i)=var(Y) which is a marginal variance (meaning it is not a conditional
variance because it is not conditional on anything.)
I usually call this the null model because it is the most reduced model possible.
TheF = 6:788 is simultaneously testing if dose and/or exercise are signi cantly related
to cholesterol reduction. And if we had additional predictor variables in the model,
this would be testing if any of the predictors were signi cantly related to cholesterol
reduction. Once we have more than a few predictors in a regression model it will
usually be the case that at least some of the predictors are signi cant. And we usually
want to examine each predictor variable separately. So the F statistic you get in the
printout from R is a model comparison that is often not of interest. However, if there
is only one predictor variable in the model then this F statistic is testing if the one
predictor variable is a signi cant predictor of cholesterol.
5
Beginning of questions
1. Create three indicator variables (same as you did in homework 5) using the following
R code.
a1=as.numeric(exercise==1)
a2=as.numeric(exercise==2)
a3=as.numeric(exercise==3)
No answer is required for this question.
2. What is the interpretation of the following parameters in model 3? Give the inter-
pretation both in words (eg., it is the mean or intercept for walking or the di erence
between walking and biking, etc.) and in terms of the conditional expected values;
That is some function of E(Yjd = 50;a = 1);:::;E(Yjd = 70;a = 3)
(a) 0
(b) 1
(c) 2
(d) 3
3. Use the values of the least squares estimates of the parameters in model 2 (given above
on page 4) to nd the values of the least squares estimates of the parameters in model
3. You can verify your answers are correct if you like by using R to get the estimates
of the parameters for model 3.
4. Models 2, 3, and 4 are three di erent parameterizations of the same model. You can
see that models 2, 3, and 4 are the same by checking that they give the same predicted
(also called \ tted") values.
m1=lm(y~dose+a2+a3)
m2=lm(y~dose+a1+a2)
m3=lm(y~dose-1+a1+a2+a3)
cbind(m1$fitted,m2$fitted,m3$fitted)
No answer is required for this question.
5. Using either your parameter estimates from either model 2 or from model 3, provide
estimates of the following.
(a) E(Yjwalk;dose=0)
(b) E(Yjbike;dose=0)
(c) E(Yjjog;dose=0)
6. Check your answers to the previous question by running model 4 (with means param-
eterization).
6
summary(lm(y~dose-1+a1+a2+a3))
No answer is required for this question.
7. Using either your parameter estimates from either model 2 or from model 3, provide
estimates of the following.
(a) E(Yjwalk;dose=55)
(b) E(Yjbike;dose=55)
(c) E(Yjjog;dose=55)
8. You can check your answers to the previous question by using a model parameterization
where the parameters in the model are equal to the expected values at dose 55.
Yi = 1(xi1 55) + 2xi2 + 3xi3 + 3xi4 +"i
xi1 = dose for subject i
xi2 =
(
1 if subject i in walking group
0 otherwise
xi3 =
(
1 if subject i in biking group
0 otherwise
xi4 =
(
1 if subject i in jogging group
0 otherwise
"i N 0; 2Yjd;a
Run using R code
doseminus55=dose-55
summary(lm(y~doseminus55-1+a1+a2+a3))
By replacing xi1 in model 4 with (xi1 55) we get the 2, 3, and 4 are the expected
values at dose 55. What we are doing is replacing the variable dose with a new variable
called doseminus55. Then 2, 3, and 4 are the expected values when doseminus55=0,
which corresponds to dose=55.
No answer is required for this question.
9. Suppose we reparameterize model 4 replacing the dose variable with dose c where c
7
is some constant. So the model is
Yi = 1(xi1 c) + 2xi2 + 3xi3 + 3xi4 +"i
xi1 = dose for subject i
xi2 =
(
1 if subject i in walking group
0 otherwise
xi3 =
(
1 if subject i in biking group
0 otherwise
xi4 =
(
1 if subject i in jogging group
0 otherwise
"i N 0; 2Yjd;a
What value of c will result in the smallest standard errors of the estimates of 2, 3,
and 4? Hints: You can answer this question by thinking about what expected values
are being estimated by b2, b3, and b4 and how that depends on the value c. Recall when
we had simple linear regression with only one continuous x variable how the variance
of the predicted value (the estimate of the expected value) depended on the value of
x.
10. Will you get the same or di erent values for the MSE for models 2, 3, and 4? You can
answer this question by actually calculating the MSE for the two models. However,
since you might see this type of question on an exam, I suggest you try to rst answer
the question without doing any calculations.
11. Conduct a hypothesis test using an F statistic to compare models 1 and 2. This is
equivalent to testing if there is an exercise e ect. The null and alternative hypothesis
can be written in several di erent ways, including
H0 : E(Yjd;a = 1) = E(Yjd;a = 2) = E(Yjd;a = 3)
Ha : H0 not true
Note that it is not necessary to specify a value for d because we are assuming there is
no interaction which means that the di erences between E(Yjd;a = 1), E(Yjd;a = 2),
and E(Yjd;a = 3) do not depend on the value of d.
(a) Write the null hypothesis H0 as a statement about the parameters 0; 1; 2; 3
in model 2.
(b) What is the value of the test statistic F ?
(c) What is the p-value?
(d) Does the data provide evidence for Ha? How did you make your decision?
12. Suppose we conduct a model comparison F test to compare models 1 and 4. This is
the same null hypothesis as in question 11. Write the null hypothesis as a statement
about the parameters 1; 2; 3; 4 in model 4.
8
13. Suppose we want to test the null hypothesis of no di erence between biking and jogging.
H0 : E(Yjd;a = 2) = E(Yjd;a = 3)
Ha : E(Yjd;a = 2)6= E(Yjd;a = 3)
Note H0 and Ha can also be written as
H0 : E(Yjd;a = 2) E(Yjd;a = 3) = 0
Ha : E(Yjd;a = 2) E(Yjd;a = 3)6= 0
Again, we don’t specify a value for d in H0 and Ha because we are assuming the no
interaction model is correct, which means E(Yjd;a = 2) E(Yjd;a = 3) does not
depend on the value of d.
(a) Use the estimated regression function from either model 2, 3, or 4 to estimate the
di erence between biking and jogging. Speci cally, give an estimate of
E(Yjd;a = 2) E(Yjd;a = 3). Note that because there is no interaction in the
model, this estimate does not depend on the value of d. Therefore, you can plug
in any value of d (using the same value for estimating both expected values).
Because the data is balanced the estimate you get from the model is exactly the
same as if you just simply subtract the mean for jogging from mean for biking. R
code for this
mean(y[exercise==2])-mean(y[exercise==3])
(b) Using the parameters from model 2 where the baseline group is walking, write the
null and alternative hypotheses as statements about the parameters ( 0; 1; 2; 3).
(c) Using the parameters from model 3 where the baseline group is jogging, write the
null and alternative hypotheses as statements about the parameters ( 0; 1; 2; 3).
(d) Using the parameters from model 4 (means parmaeterization), write the null and
alternative hypotheses as statements about the parameters ( 1; 2; 3; 4).
(e) Give the reduced model for the null hypothesis. Note that there are several
di erent ways to parameterize the model.
(f) Conduct a model comparison F test for this null hypothesis. Give the value of
the test statistic F , the p-value, and your conclusion if the signi cance level is
= 0:05
(g) Reparameterize so that there is a single parameter that is equal to
E(Yjd;a = 2) E(Yjd;a = 3).
i. Write out this model making sure to clearly de ne all of your indicator vari-
ables.
ii. Run a regression using this model and compare the t and p-value for the
parameter in your model that equals E(Yjd;a = 2) E(Yjd;a = 3) with the
F and p-value from the model comparison test. Check that you get the same
p-value you gave in part (f). Now wasn’t that easier to just reparameterize
the model and get the estimate you wanted and p-value straight from the R
output.
No answer is required for this question.
9
14. Suppose we took dose out of the model.
Yi = 3 + 1xi1 + 2xi2 +"i (5)
xi1 = I(subject i walks)
xi2 = I(subject i bikes)
where I(statement) is an indicator variable: i.e. it is 1 if ‘statement’ is true and 0
otherwise.
Note that the answers to questions (a), (b), and (c) depend on whether or not the
data is balanced, and for this dataset they are. You can verify the data is balanced
by calculating the covariance between dose and exercise to check that it is 0. And as
a consequence the covariances between dose and each of the three indicator variable is
also 0.
cov(dose,exercise)
cov(dose,a1)
cov(dose,a2)
cov(dose,a3)
(a) If you use the model given by equation (5) to estimate E(Yja = 2) E(Yja = 3),
would you get the same value as you got in question 13(a)?
(b) If you use the model given by equation (5) to calculate a p-value to test the null
hypothesis of no exercise e ect, would you get the same p-value as you did in
question 13? If not would the p-value be larger or smaller?
(c) If you are only interested in testing for an exercise e ect (and do not care about
the dose e ect), should you use a model with or a model without dose? Why?