THE AUSTRALIAN NATIONAL UNIVERSITY
RESEARCH SCHOOL OF FINANCE,
ACTUARIAL STUDIES AND STATISTICS
STAT3008/STAT7001
APPLIED STATISTICS
Assignment 2
Lecturer: Dr Tao Zou
Last Updated: Sun Oct 7 21:54:48 2018
This assignment is due at 11:00 am, Oct 17, 2018.
This assignment is worth 15% of your final grade but is optional and redeemable.
Students are expected to complete this assignment individually. Maximum points:
15.0. You cannot get partially correct for all the questions, since each question is
only worth 0.5 points. Assignments can only be submitted via the physical
assignment box at the front of the reception on Level 4, CBE Building
(26C). Hard copy submission is required. Late submission will not be accepted
and the weight will roll over to your final exam. Identical submissions are treated as
cheating.
Please exactly follow the instructions of questions and write down your short
answers of the following questions in the answer sheet file on the Wattle. Note
that you do not need to copy the questions in the answer sheet. Please only submit
your finished answer sheet and do not paste any unrelated results. The data used in
this assignment can be found on the Wattle.
The significance level for all the questions is set to be 0.05.
1
Question 1 (Variable Selection and Multicollinearity, 5.0 points)
Consider the data used in Quesitons 1 - 3 in Assignment 1. Please split the dataset
with 463 individuals of course evaluations into two parts: the training dataset and
the test dataset. The training dataset includes the first 400 observations and the test
dataset includes the last 63 observations, respectively. Based on these two datasets,
please use R to answer the following questions in the answer sheet.
Part 1. (2.5 points) In this part, please only use the training dataset to obtain the
variable selection and fitting results.
a) (0.5 points) If we only consider the response variable (logarithm of “Course_eval”)
and explanatory variables used in Quesiton 1 of Assignment 1, please paste the
Cp plot among all subsets in the answer sheet (showing at most 2 subsets for
each size).
b) (0.5 points) Based on the above Cp statistics, which variables should we choose
to predict the logarithm of “Course_eval” by using the variable selection among
all subsets?
c) (0.5 points) If we still consider the response variable (logarithm of “Course_eval”)
and explanatory variables used in Quesiton 1 in Assignment 1, please use R to
obtain the variance inflation factor (VIF) for each of the explanatory variables,
and paste them in the answer sheet. Based on the “rule of thumb” cut-off for
VIF, does the multicollinearity problem exist if we regress the response on these
explanatory variables?
d) (0.5 points) If we still consider the response variable (logarithm of “Course_eval”)
and explanatory variables used in Quesiton 1 of Assignment 1, please use R to
perform. the backward elimination based on BIC. Which variables should we
choose to predict the response variable (logarithm of “Course_eval”) by using
this variable selection method?
e) (0.5 points) Please paste the R codes for all the above analyses of Part 1 in the
answer sheet.
2
Part 2. (2.5 points) In this part, please use the variable selection and fitting results in
Part 1 and then use the test dataset to evaluate the forecast performances of different
fitted models.
a) (0.5 points) If we still consider the response variable (logarithm of “Course_eval”)
and all the explanatory variables used in Quesiton 1 in Assignment 1, please fit
the training data to obtain a fitted model. What is the mean squared prediction
error (MSPE) of the logarithm of “Course_eval” for this fitted model based on
the test dataset? Please round your result to four decimal places.
b) (0.5 points) Please use the variables selected in b) of Part 1 and fit the training
data to obtain a fitted model. What is the mean squared prediction error (MSPE)
of the logarithm of “Course_eval” for this fitted model based on the test dataset?
Please round your result to four decimal places.
c) (0.5 points) Please use the variables selected in d) of Part 1 and fit the training
data to obtain a fitted model. What is the mean squared prediction error (MSPE)
of the logarithm of “Course_eval” for this fitted model based on the test dataset?
Please round your result to four decimal places.
d) (0.5 points) Based on the above results, which model among a) - c) in this part
is the best?
e) (0.5 points) Please paste the R codes for all the above analyses of Part 2 in the
answer sheet.
3
Wheat Kernels (revised based on the final exam in 2017). The presence of diseased
kernels in wheat can reduce the value of a wheat producer’s entire crop. It is
important to identify these kernels after being harvested but prior to sale. To facilitate
this identification process, automated/artificial intelligence (AI) systems have been
developed to separate healthy kernels from the rest. Improving these systems requires
better understanding of the measurable ways in which healthy kernels differ from
kernels that are infected or partly infected with a fungus. To this end, Martin et
al. (1998) conducted a study examining numerous physical properties of kernels -
density, hardness, size, weight, and moisture - measured on a sample of wheat kernels
from two different classes of wheat, hard red winter (hrw) and soft red winter (srw)
(represented by the categorical variable “class”) in the “wheat.csv” dataset. Each
kernel’s condition was also classified as “Healthy,” and “Not Healthy” by human visual
inspection (represented by the categorical variable “type1”). Moreover, human visual
inspection can further classify the “Not Healthy” category into “Partly Diseased” and
“Diseased” (represented by the categorical variable “type2”).
Please split the dataset “wheat.csv” with 275 observations into two parts: the training
dataset and the test dataset. The training dataset includes the first 200 observations
and the test dataset includes the last 75 observations, respectively. Based on these
two datasets, please use R to answer the following Questions 2 - 3 in the answer sheet.
Question 2 (Binary Logistic Regression, 4.0 points)
In this question, we are only interested in predicting the two-category response “type1”
in the dataset “wheat.csv”. Please use R to answer the following questions in the
answer sheet.
Part 1. (2.5 points) In this part, please only use the training dataset to obtain the
variable selection and fitting results.
a) (0.5 points) After reading the dataset “wheat.csv” into a data frame. in R, the
column “class” in the dataset is a vector of factor values. Please use R to regress
“type1” on “class”, “density”, “hardness”, “size”, “weight”, and “moisture”,
together with all the interactions between “class” and the other continuous
explanatory variables, in order to answer the question whether the probability
of “Healthy” of a wheat kernel is associated with other variables. Please do
not use the indicator variable of “class”, but instead, use the factor values of
“class” directly in the fitting of the regression with the category “srw” as the
baseline level. Based on the “summary” function output of this fitted model, can
we use the “Null deviance” and “Residual deviance” in the output to construct
a drop-in-deviance χ2-test? If we can, what are the null hypothesis and the
alternative hypothesis of this test?
4
b) (0.5 points) If we can construct a drop-in-deviance χ2-test in the above question,
please use R to accomplish this χ2-test. What is the value of the test statistic?
What conclusion can you obtain for this χ2-test? If we cannot construct a
drop-in-deviance χ2-test in the above question, please state your reasons.
c) (0.5 points) Consider all the variables that you input in R in a) of this part
and then use R to perform. the forward selection based on AIC. Which variables
should we choose to predict the probability of “Healthy” by using this variable
selection method?
d) (0.5 points) Consider all the variables that you input in R in a) of this part
and then use R to perform. the forward selection based on BIC. Which variables
should we choose to predict the probability of “Healthy” by using this variable
selection method?
e) (0.5 points) Please paste the R codes for all the above analyses of Part 1 in the
answer sheet.
Part 2. (1.5 points) In this part, please use the variable selection and fitting results in
Part 1 and then use the test dataset to evaluate the forecast performances of different
fitted models. Let Ylscript. represent the response “type1” for observations lscript. = 1,···,ntest
in the test dataset. Suppose ˆYlscript. to be the corresponding prediction of response based
on the fitted models from the training dataset. Define an indicator variable
I{Ylscript. = ˆYlscript}=
braceleftBigg
1, if Ylscript. = ˆYlscript;
0, otherwise.
Then the percentage of correct forecast (PCF) can be defined by
PCF = 1n
test
ntestsummationdisplay
lscript=1
I{Ylscript. = ˆYlscript}.
a) (0.5 points) Please use the variables selected in c) of Part 1 and fit the training
data to obtain a fitted model. What is the PCF for this fitted model based on
the test dataset? Please round your result to four decimal places.
5
b) (0.5 points) Please use the variables selected in d) of Part 1 and fit the training
data to obtain a fitted model. What is the PCF for this fitted model based on
the test dataset? Please round your result to four decimal places. Based on these
two PCFs, which model between a) - b) in this part is better?
c) (0.5 points) Please paste the R codes for all the above analyses of Part 2 in the
answer sheet.
6
Question 3 (Multicategory Response Regression, 4.5 points)
In this question, we are only interested in predicting the three-category response
“type2” in the dataset “wheat.csv”. Please use R to answer the following questions in
the answer sheet.
Part 1. (3.0 points) In this part, please only use the training dataset to obtain the
fitting results.
a) (0.5 points) Obviously, the three-category response “type2” is ordinal with the
order “Diseased” < “Partly Diseased” < “Healthy”. Please use the ordinal
response regression model to regress the ordinal response “type2” on “class”,
“density”, “hardness”, “size”, “weight”, and “moisture”. Please do not consider
any interaction terms. Also in the regression, please do not use the indicator
variable of “class”, but instead, use the factor values of “class” directly in the
fitting of the regression with the category “srw” as the baseline level. How many
unknown regression coefficients in this ordinal response regression model?
b) (0.5 points) What is the 95% confidence interval for the coefficient of “density”
(rounded to four decimal places) based on the above fitted ordinal response
regression model?
c) (0.5 points) If we are interested in testing whether or not “density” is needed
based on the above fitted ordinal response regression model, please construct
an appropriate test. What is the p-value for your test (rounded to four decimal
places)? What conclusion can you obtain based on the p-value?
d) (0.5 points) Suppose we now ignore the order of “Diseased” < “Partly Diseased”
< “Healthy” and treat “type2” as a nominal response. Please use the nominal
response regression model to regress the nominal response “type2” on those
explanatory variables that you input in R in a) of this part. How many unknown
regression coefficients in this nominal response regression model?
7
e) (0.5 points) If we are interested in testing whether or not “density” is needed
based on the above fitted nominal response regression model, please construct
an appropriate test. What is the p-value for your test (rounded to four decimal
places)? What conclusion can you obtain based on the p-value?
f) (0.5 points) Please paste the R codes for all the above analyses of Part 1 in the
answer sheet.
Part 2. (1.5 points) In this part, please use the fitting results in Part 1 and then use
the test dataset to evaluate the forecast performances of different fitted models. Let
Ylscript. represent the response “type2” for observations lscript. = 1,···,ntest in the test dataset.
Suppose ˆYlscript. to be the corresponding prediction of response based on the fitted models
from the training dataset. Define an indicator variable
I{Ylscript. = ˆYlscript}=
braceleftBigg
1, if Ylscript. = ˆYlscript;
0, otherwise.
Then the percentage of correct forecast (PCF) can be defined by
PCF = 1n
test
ntestsummationdisplay
lscript=1
I{Ylscript. = ˆYlscript}.
a) (0.5 points) Please use the fitting of the training data in a) of Part 1 to obtain a
fitted ordinal response regression model. What is the PCF for this fitted model
based on the test dataset? Please round your result to four decimal places.
b) (0.5 points) Please use the fitting of the training data in d) of Part 1 to obtain a
fitted nominal response regression model. What is the PCF for this fitted model
based on the test dataset? Please round your result to four decimal places. Based
on these two PCFs, which model between a) - b) in this part is better?
c) (0.5 points) Please paste the R codes for all the above analyses of Part 2 in the
answer sheet.
8
Question 4 (Simulation for Binary Logistic Regression, 1.5 points)
Consider the binary logistic regression model logit(µ{Y|X}) = β0 + β1X for the
observations{(Yi,Xi) : i = 1,···,n}, and the maximum likelihood estimation (MLE)
ˆβ0 and ˆβ1 for the regression coefficients β0 and β1 can be obtained, where the logit
link function logit(u) = log{u/(1−u)}.
Lily wants to use R to generate random samples based on the binary logistic regression
model assumptions, in order to understand the “roughly” unbiased property for MLE,
as well as “approximate” normality for the sampling distribution of MLE. She follows
the steps below.
Step 1: Specify β0 = 1 and β1 = 2.
Step 2: Suppose the observations X1,···,Xn are 0.01,0.02,0.03,···,1.00, so the
number of observations n = 100.
Step 3: Compute
pii = e
β0+β1Xi
1 +eβ0+β1Xi for i = 1,···,n.
Step 4: Generate Yi independently from the Bernoulli distribution with P(Yi =
1|Xi) = pii and P(Yi = 0|Xi) = 1−pii for i = 1,···,n. (Hint: R function
“rbinom(1,1,pii)” returns one random number of Yi from the Bernoulli distribution
with P(Yi = 1|Xi) = pii and P(Yi = 0|Xi) = 1−pii.)
Step 5: Repeat Step 4 1,000 times and obtain 1,000 different datasets of{(Yi,Xi) :
i = 1,···,n}.
Lei Li is a friend of Lily. Lily hands over the above 1,000 datasets to him but she does
not tell him the true values of β0 and β1. Based on each dataset, Lei Li computes the
MLE ˆβ0 and ˆβ1. Ultimately, he obtains 1,000 different MLEs.
Then Lily tells Lei Li the true value of β1 and both Lily and Lei Li compare this true
value to the sample average of these 1,000 different MLEs ˆβ1, as well as the histogram
of these 1,000 estimates ˆβ1.
Please answer the following questions in the answer sheet.
a) (0.5 points) Suppose you play both roles of Lily and Lei Li and realise the above
steps in R. Please paste the complete R codes for all the above procedures in the
answer sheet.
9
b) (0.5 points) What is the sample average of 1,000 estimates ˆβ1 (rounded to four
decimal places). Is it close to the true value of β1? Please answer this question
in the answer sheet.
c) (0.5 points) Please paste the histogram plot of 1,000 estimates ˆβ1 in the answer
sheet. Is it close to the normal distribution?