讲解留学生R语言、R编程辅导、调试R、R编程解析、解析top sheet编程

IMPORTANT INSTRUCTIONS
All homework pages (except the top sheet) must be stapled (before you come to class).
The rst page (‘top sheet’) should contain ONLY your name, student ID, discussion section,
and homework number. Use the format shown below. Do NOT staple to the rest of your
homework.
The second page (which means a new sheet of paper, so not the back side of the rst page)
should ALSO contain your name, student ID, discussion section, and homework number.
Use the format shown below.
After I collect homeworks I put all of the top sheets into a folder before passing the home-
works to the grader. If at any point during the quarter there is a homework that you know
you turned in, but it does not show on Canvas, contact me. I will look to see if I have your
top sheet from the homework. If I have your top sheet then I will give you credit for the
homework.
It is your responsibility to make sure every homework assignment you submit has a top sheet
with your correct discussion section number. If you tell me you turned in a homework, but
there is no Canvas grade and no top sheet in my folder then you will get a 0.
You will not loose any point for not making a top sheet. But if your homework goes missing
you will have no way to prove you turned it in.
On both the rst page (‘top sheet’) and second page write your name and student ID on the
top left, homework number on the top center, and section on the top right. For example,
for homework 3 if your name is John Smith, your student ID is 123456789, and you are in
section A01, then the top of your rst page should look like this
John Smith Homework 3 A01
123456789
Points lost if you
don’t follow the rule
correct format for name, ID, homework and section number 1
Staple all pages EXCEPT the top sheet 1
If your homework is on paper pulled out of a notebook,
cut o all of the fringes (from the torn horizontal threads
that attached the paper to the notebook). 1
Please do not turn in your R code with your homework.
Be kind to the grader.
make sure you write your name clearly (so it is easy to read)
write neatly
The data are from a hypothetical study (I simulated the data) examining the relationship
between dose of a cholesterol lowering drug and reduction in serum cholesterol: x=dose and
2
y=reduction in cholesterol. The study has 5 drug doses (50, 55, 60, 65, and 70) and there
are 4 subjects for each dose, so the total number of observations is (5)(4)=20. You may get
the data by either of the following two methods.
1. Download the le STA108HW3drug.txt from Canvas. Then submit the following R
command so you can access the variables x and y directly.
attach(STA108HW3)
2. Cut and paste the following commands into R studio.
x=c(50,50,50,50,55,55,55,55,60,60,60,60,65,65,65,65,70,70,70,70)
y=c(63,23,30,38,66,86,40,63,63,32,77,103,88,70,53,59,71,75,99,103)
It is always a good idea when importing data into R to make sure the data has been correctly
imported. One way to do this is with a quick check of the sample means which are x = 60
and y = 65:1. This does not guarantee the data has been imported correctly, but it is still
a good check to do.
Note: This is not the same data as I used in the example in class.
The simple linear regression model for this data is
Yi = 0 + 1xi +"i for i = 1;:::;n (1)
where Yi is a random variable for the outcome of the ith experimental unit, 0 and 1 are
parameters and "i is a random variable. We make the following assumptions.
1. E("i) = 0 for i = 1;:::;n
2. "1;:::;"n has constant variance which we de ne as
2("i) = 2 for i = 1;:::;n
3. The "1;:::;"n are uncorrelated. That is, for any i6= j we have ("i;"j) = 0.
4. "i N for i = 1;:::;n (that is the "i are all normally distributed)
Let b0 and b1 be the least squares estimates of 0 and 1.
Some R functions to use for this homework.
To use the R function lm() for linear regression you use the following commands.
myresult=lm(y~x)
summary(myresult)
3
No idea why the ~ prints as if it is a superscript. It is not.
The rst command creates the object myresult, which is a special type of object that is
called a list because it contains a whole ‘list’ of stu (estimates, residuals, etc.). The tricky
part of using R is to know how to pull out what you want from myresult. The summary
function pulls out the most important parts of the results. R will print out a bunch of stu ,
and in the middle is a table with the following headings. Here are what the values are (with
the row and column labels you get in the output).
Coe cients:
Estimate Std. Error t value Pr(>jtj)
(intercept) b0 s(b0) t = b0s(b
0)
p-value for H0 : 0 = 0 vs H1 : 0 6= 0
x b1 s(b1) t = b1s(b
1)
p-value for H0 : 1 = 0 vs H1 : 1 6= 0
The p-values are two sided. You get a p-value for the intercept that tests H0 : 0, but this
is usually not a relevant hypothesis, so we usually don’t pay any attention to this p-value.
To get other results (such as the residuals or the MSE) out of the list variable you have to
know what R calls it. The following command gives you all of the names of the stu (values,
etc.) in myresult
names(myresult)
The printed results you get from this command look something like this.
[1] "coefficients" "residuals" "effects" "rank" "fitted.values"
"assign" "qr"
[8] "df.residual" "xlevels" "call" "terms" "model"
Now you still might have no idea what half of these things are, but often you nd what you
are looking for. Now you can pull out any of these things using commands that look like
this: myresult$thingIwant. For example
myresult$coefficients #THESE ARE THE LEAST SQUARES ESTIMATES
myresult$residuals #THESE ARE THE RESIDUALS
The annoying part is that sometimes the value(s) you want won’t show up in the list of
names or you won’t know what it is called. For example, how do we get the MSE? Hard to
explain, so will just show the commands
myresult2=summary(lm(y~x))
This creates another list object called myresult2 which contains the results of the summary()
function. When we check to see what the list object called myresult2 contains
names(myresult2)
we get
[1] "call" "terms" "residuals" "coefficients" "aliased"
"sigma" "df"
[8] "r.squared" "adj.r.squared" "fstatistic" "cov.unscaled"
4
Again, you might not know what these things are, but \sigma" here is the square root of
the MSE. So if you want the MSE you can use the command
MSE=myresult2$sigma^2
or alternatively
MSE=(summary(myresult)$sigma)^2
nding p-values for F statistics
For a hypothesis test where the test statistic (TS) has an F distribution, the p-value is
p-value = P(F >TS): (2)
R has a function that gives area under the F density function to the left of a given value
(for example the test statistic). So when we use R to calculate a p-value we use
p-value = 1 P(F TS); (3)
which is the same value as you get from equation (2). For a random variable with an F
distribution with numerator degrees of freedom d1 and denominator degrees of freedom d2,
the following command gives the area under the density function to the left of c.
pf(c,d1,d2)
which gives
P(F c)
where F has an F distribution with numerator degrees of freedom d1 and denominator
degrees of freedom d2.
To use R to get a p-value for a TS that has an F distribution we use equation (3), so the
command is
1 - pf(TS,d1,d2)
where TS is the test statistic.
Note that for the p-value for an F test statistic, we do not multiply by 2 as we do for a t
test statistic.
Part A
1. Plot the data. Make sure you label the x-axis and the y-axis. Do not print out the
results until you have added the regression line in question 6.
plot(x,y,xlab="dose",ylab="cholestrol decrease",main="drug effect")
5
2. Look at the plot and describe what you see. In particular, your description should
answer the following questions.
(a) Does there appear to be a linear relationships between x and y?
(b) Are there any data points that look like they could be outliers? These are data
points that either are extreme (large or small) or appear to deviate in some way
from the overall pattern of the data.
(c) Does the variance of Y conditional on x appear to be constant? That is, does the
spread of the 4 observed Y values for each of the 5 di erent x values look to be
about the same?
There are not absolute right and wrong answers to these questions. Answering these
questions by just looking at the graph is subjective. It takes lots of experience from
working with many real datasets before one gets good at making these subjective calls.
The important point here is that you always want to look at the data values before
doing any analysis. If there was anything really strange about the data that would
mess up the analysis, you often can see it very easily by looking at the graph.
3. Using the lm() function,
myresult=lm(y~x)
summary(myresult)
give the following values.
(a) b0
(b) s(b0)
(c) b1
(d) s(b1)
(e) the test statistic t for testing H0 : 1 = 0 versus H1 : 1 6= 0
(f) the p-value for testing H0 : 1 = 0 versus H1 : 1 6= 0
For this question I just want to be sure you know how to read the values from the
R output. So do NOT just include the output from the R command. I want you to
identify what each number in the output is.
Because I simulated this data, I know the true values of the population intercept and
slope, which are 0 = 30 and 1 = 1:6. Notice that the least squares estimate of 1 is
much closer to the true value than is the least squares estimate of 0. This is because
the sampling variance of b1 is much smaller than the sampling variance of b0: Look at
the estimates of the two sampling standard deviations s(b0) and s(b1). No question
here, just trying to maximize what you learn while doing the homework.
4. For testing H0 : 1 = 0 versus H1 : 1 6= 0, does the data provide evidence in favor of
H1? How did you make your decision?
5. Find the MSE.
MSE=(summary(myresult)$sigma)^2
MSE
Since I simulated this data I know that 2 = 256. It is not unusal for the MSE to be
a bit far away from the true value. The sampling distribution of the MSE has a larger
variance than the sampling distribution of the least squares estimates b0 and b1.
6. Add the regression line to the graph, print it, and turn in with your homework.
#CODE ASSUMES YOU HAVE RUN THE plot COMMAND ABOVE
#GET THE LEAST SQUARES ESTIMATES
b0=myresult$coefficients[1]
b1=myresult$coefficients[2]
#FIND THE PREDICTED VALUE FOR Y WHEN X=50
pred50=b0+b1*50
#FIND THE PREDICTED VALUE FOR Y WHEN X=70
pred70=b0+b1*70
lines(x=c(50,70),y=c(pred50,pred70))
7. Plot the residuals against the dose. Turn in with your homework.
#CODE ASSUMES YOU HAVE RUN THE COMMAND myresult=lm(y~x) ABOVE
plot(x,myresult$residuals,xlab="dose",ylab="residual")
abline(h=0) #THIS ADDS A HORIZONTAL LINE AT Y=0.
8. Examine the residual plot from question 7 and answer the following questions.
(a) Does there appear to be any pattern that suggests the relationship between x and
Y is not linear?
(b) Do there appear to be any outliers? That is, are there any residuals that are
extreme (very large or very small)?
(c) Does it appear that the assumption of constant variance does not hold? That is,
does the variance of the 4 residuals for each of the 5 x values look to be relatively
constant?
Just like with question 2, the answers to these questions do not have an absolute right
and wrong. They are subjective, and it takes lots of experience working with real data
to get a good feel for looking at residual plots. Also, notice questions (a), (b), and (c)
here correspond to the (a), (b), and (c) in question 2. Generally it is easier to answer
these questions looking at the residual plot than it is looking at the graph you made
in question 1.
9. What is the mean of the 20 residual values? That is, what is
where ei = yi b0 b1xi
Answer this question based on the theoretical results (i.e., not based on the actual mean
of the 20 residual values, which you can calculate using R, but because of rounding
errors you will not get the exact same answer as the theoretical result).
mean(myresult$residuals)
10. Does your answer to question 9 tell you anything about whether the regression as-
sumptions E("i) = 0 holds? A simple ‘yes’ or ‘no’ is all you need to write for this
answer.
11. Plot the semistudentized residuals against the dose. These are the residuals standard-
ized by the estimate of , which is pMSE.
plot(x,myresult$residuals/sqrt(MSE),xlab="dose",ylab="semistudentized residual")
abline(h=0) #THIS ADDS A HORIZONTAL LINE AT Y=0.
(You do not need to turn in this plot with your homework.) Compare the residual plot
using the semistudentized residuals to the residual plot you made in questions 7 and
11. Do they look the same or di erent? What is the only di erence between the two
plots?
12. Looking at the semistudentized residual plot you made in question 11, do there appear
to be any outliers? Explain why it is much easier to make this decision about the
presence of outliers using the semistudenized residuals as compared to the (regular)
residuals.
PART B
Since we have 4 values of the response variable y for each value of x we modify the notation
yij = observed value of the response variable for the ith subject with dose= xj
Yij = the corresponding random variable
i=subject
j=dose
i = 1;2;3;4 (4 subjects for each dose)
j = 1;2;3;4;5 (5 di erent doses: 50,55,60,65,70)
n = total number of subjects (n = 5 4 = 20)
c=number of di erent x values (c = 5)
There are three di erent models that may be appropriate for this data. For any model, the
sum of squared errors is
Each model has di erent predicted values ^yij.
Model A: no dose e ects
Yij = 0 +"ij (4)
"ij N(0; 2)
Not counting 2, model A has one parameter ( 0) so the degrees of freedom for SSE(A) is
n 1. The predicted value for model A is ^yij = y, where y is the mean of all yij values.
Model B: linear regression
Yij = 0 + 1xij +"ij (5)
"ij N(0; 2)
Not counting 2, model B has 2 parameter ( 0 and 1) so the degrees of freedom for SSE(B)
is n 2. The predicted value for model B is ^yij = b0 +b1xij.
Model C: di erent mean for each dose
Yij = j +"ij (6)
"ij N(0; 2)
Not counting 2, model C has c = 5 parameter ( 1, 2, 3, 4, 5) so the degrees of freedom
for SSE(C) is n c = n 5. The predicted value for model B is ^yij = yj, where yj is the
mean of the four yij values for dose j.
model selection F test
For a general explanation of the model selection F test see section 2.8 in the book.
For the lack of t test (which compares model B with C) see section 3.7 in the book. Warning
they use di erent notation for SSE.
For this homework we have the following notation for SSE
SSE(A) = sum of squared errors for model A
SSE(B) = sum of squared errors for model B
SSE(C) = sum of squared errors for model C
SSE(F) = sum of squared errors for the full model
SSE(R) = sum of squared errors for the reduced model
Which is the full and which is the reduced model depends on which models you are comparing.
The test statistic for model selection is
where SSE(R) is the sum of squared errors for the reduced model, SSE(F) is the sum of
squares errors for the full model, dfR is the degrees of freedom for the reduced model, and
dfF is the degrees of freedom for the full model. Which model is called reduced and which
is called full depends on which comparison you are making.
For any model, the parameter space is the set of possible values for the parameters in the
model. For example, for simple regression the parameter space for both 0 and 1 is the
set of all real numbers, and the parameter space for 2( i) is the set of all non-negative
numbers.
For the model selection F test, the reduced model is the one that you can get from the
full model by constraining the parameter space. In many cases the constraint is setting a
regression parameter to 0. For example if you are comparing models A and B, then model A
is the reduced model and model B is the full model because if we use the constraint 1 = 0
on model B we get model A.
R code to get the SSE for each of the three models.
nd SSE(A)
There are multiple ways to use R to nd SSE(A). The predicted value for yij using model A
is y. So you can nd the sum of squared errors using the formula
where y is the sample mean of the 20 y values (y11;y12;:::;y54). R Code is
SSEA=sum((y-mean(y))^2)
Another method is to use the regression model given by equation (4) which is the same as
a regression model with 1 = 0. In R you can run a model that only contains an intercept
(e ectively setting 1 = 0) and then get SSE(A) from the residuals.
modelA=lm(y~1) #~1 MEANS USE A MODEL WITH ONLY THE INTERCEPT
SSEA=sum(modelA$residuals^2)
SSEA
You may use either method to get SSE(A).
nd SSE(B)
To get the SSE(B) for model B we use the regular regression model, since SSE(B)=SSE,
where this is the SSE from the simple linear regression model.
modelB=lm(y~x)
SSEB=sum(modelB$residuals^2)
nd SSE(C)
There are multiple methods to nd SSE(C). One method is to run a regression where x is
modeled as a categorical variable (with 5 categories for the 5 doses). We will see later in the
course that regression can be used when the x variable is categorical.
modelC=lm(y~factor(x)-1) #factor(x) CREATES A CATEGORICAL VARIABLE
#THE factor(x)-1 GIVES THE SAME MODEL PARAMETERIZATION AS GIVEN ABOVE FOR MODEL C
#TO SEE THE ESTIMATES OF THE 5 PARAMETERS FROM THIS MODEL
summary(modelC) #YOU DO NOT NEED TO RUN THIS TO GET SSEC
SSEC=sum(modelC$residuals^2)
SSEC
You could alternatively use the formula for SSE(C), which is
where y1,. . . , y5 are the sample means corresponding to each value of x. The advantage of
the rst method is that the R code is much shorter than you would need to actually use this
formula.
This is NOT the same data as the example I did in class, so don’t expect to get the same
numbers.
1. Conduct a hypothesis test of H0 : 1 = 0 versus H1 : 1 6= 0 using the model selection
F test approach. Use signi cance level = 0:05.
Hint: The reduced model is model A (no dose e ect) and the full model is model B
(linear regression).
(a) Find the value of the test statistic F .
Hint: You can verify your value of F is correct by checking that it is equal to
(t )2 from part A question 3(e). I suggest you actually do the calculation to nd
F and don’t cheat by using (t )2.
(b) What is the distribution of the test statistic F if H0 is true? Be sure to include
both the numerator and denominator degrees of freedom in your answer.
(c) Find the p-value.
(d) Does the data provide evidence for H1 : 1 6= 0? How did you make your decision?
2. Conduct a lack of t test for the regression model. Use signi cance level = 0:05.
Hint: The reduced model is model B (linear regression) and the full model is model C
(di erent means for each dose) so the null and alternative hypotheses are
H0 :E(Yijjxj) = 0 + 1xij
H1 :E(Yijjxj) = j
(a) Find the value of the test statistic F .
(b) What is the distribution of the test statistic F if H0 is true? Be sure to include
both the numerator and denominator degrees of freedom in your answer.
(c) Find the p-value.
(d) Does the data provide evidence for \lack of t", that is evidence in favor of H1?
How did you make your decision?