辅导R编程、R设计辅导留学生、辅导留学生Machine Learning设计

APANPS4335: Machine Learning
Due February 24 by 11:55pm
Directions: please submit your homework as two files — .Rmd and .pdf — on the Canvas class website.
Include short explanations of your code and results throughout.
1 The Challenger Disaster(20 points)
The NASA space shuttle Challenger exploded on January 28, 1986, just 73 seconds after liftoff, in one of the
worst disasters in U.S. space exploration history. The explosion came after 23 successful space shuttle flights.
On the morning of January 28, however, the weather was unusually cold and engineers warned that certain
components — particularly the rubber O-rings that sealed the joints of the shuttle’s solid rocket boosters —
were vulnerable to failure at low temperatures. These warnings went unheeded.
The dataset challenger.csv on the Canvas site includes predictor temptr in degrees Fahrenheit at the
time of the 23 flights prior to the Challenger disaster, and predictor F.distress indicating 1 if at least one
primary O-ring suffered thermal distress, 0 otherwise. Load this dataset.
1.1 Logistic Regression
Use logistic regression to model the effect of temperature (temptr) on the probability of thermal distress to
the O-rings (F.distress). That is, fit the model.
1.2 Estimate β1
Estimate β1, as the effect of temperature on the probability of thermal distress. Interpret your result. Note
that in logistic regression, we interpret coefficients as odds or log odds, not averages as with linear regression.
See page 133 in ISL for more on interpreting coefficients.
1.3 Confidence intervals
Construct a 95% confidence interval to describe the effect of temperature on the odds or log odds of thermal
distress (i.e., construct a 95% interval for β1). Interpret your confidence interval.
1.4 Probability of distress at 31 degrees
Predict the probability of thermal distress at 31 degrees Fahrenheit, which was the temperature at the time
of the Challenger flight.
1.5 Fifty percent probability of distress
At what temperature does the predicted probability of O-ring distress equal 0.5? Hint: You’ll recall that
with logistic regression, we use the logistic function:
1
p(X) = e
β0+β1x
(1 +eβ0+β1x)
Now consider the value of p(X) when:
p(X) = e
0
(1 +e0)
1.6 Confusion matrix
Compute a confusion matrix and overall number of correct predictions. What does the confusion matrix
reveal about the types of errors present in this case using logistic regression? See page 145 in ISL for more on
confusion matrices.
2 Collinearity (20 points)
When modeling, we want predictor variables that are correlated with the response variable. We do not
want predictors correlated with each other. Interpredictor correlation may result in collinearity (also called
multicollinearity), and can destabilize a model. If two predictor variables are nearly identical, ourβ coefficients
may change arbitrarily (see Lecture 4, slide 17) and cause the standard error (SE) to increase, hiding significant
relationships between response and predictors.
We can test for collinearity by computing a Variance Inflation Factor (VIF), which measures the association
(correlation) among the predictor variables — excluding the response variable. It is given by:
VIF = 11−R2
j
R2j measures how much one predictor is correlated with other predictors. When a predictor is totally
uncorrelated (independent of) with others, then VIF = 1(1−0) = 1. As predictors approach complete
collinearity (or extreme multicollinearity) with other predictors, VIFx→1 = 1(1−x) →∞.
Some people believe inter-predictor correlation may be too high when VIF becomes greater than 2.5, while
others flag the correlation as problematic when VIF becomes greater than 10. Here, we will use VIF > 10.
2.1 Variance Inflation Factor
Below, we will calculate VIF in R using predictors in the data cig_sales.txt. This is the same data as we
used in homework assignment 2. Start with the predictor variable Female.
2.1.1 Coefficient of determination
Compute the coefficient of determination R2j for Female by regressing (using R’s lm() function) Female over
the other predictors (e.g., Age, HS, ... Price) in cig_sales.txt, but not for the response variable Sales.
2.1.2 VIF for Female and other predictors
Once you determine R2j for Female, compute Female’s VIF using the above formula. Separately, compute
VIF for the other five predictor variables. (Again, don’t use the response variable Sales.)
2
2.2 Draw conclusions
Do your results raise concerns about collinearity within this dataset? If so, what approach might you take to
resolve your concerns? (Hint: review Lecture 4 slides.)
3 Compare LDA, Logistic, QDA, and KNN (20 points)
Answer true or false for each statement and explain why.
A. LDA is almost never used when we have more than two response classes.
B. When classes are well separated, logistic regression often becomes unstable.
C. LDA often outperforms logistic regression when observations are drawn from a non-normal distribution.
D. QDA is often used as a compromise between the linear logistic regression and LDA approaches, on one
end, and the non-linear, non-parametric KNN approach, on the other end.
E. Suppose we have two Gaussian classes, Black and Blue, and two predictors, X1 and X2. The correlation
between X1 and X2 is 0.6 for the Black class. The correlation between X1 and X2 is 0.6 for the Blue
class. We expect our Bayes decision boundary to be linear.
F. KNN often works better than QDA when n (number of observations) is high, but KNN often falters
when p (number of parameters) is high.
G. QDA can accurately model a wider range of problems than linear methods like LDA and logistic
regression.
H. Although logistic regression is a linear method, it can sometimes mimic a non-linear method, and a
non-linear relationship between a predictor and a class, through transformation of predictors, e.g.,
including predictors like X2 and X3.
I. If we draw 40 observations from a t-distribution, we expect LDA to outperform. logistic regression.
J. If the Bayes decision boundary is moderately non-linear, QDA often works better than logistic regression,
LDA, and KNN. But if the true decision boundary is complex and jagged, KNN often out-performs
other methods.
4 Classification on dog cancer data (25 points)
Download the file echogen.txt from Canvas. These data are from a study evaluating cancer in dogs and
can be used to answer the question, are dogs’ lymph nodes benign or malignant? The predictors reflect
several associated ultrasonography measurements. The study’s objective was to evaluate use of ultrasound
to characterize lymph nodes in dogs. The six variables in these data include, Echogen, Flowdist, meanPI,
meanRI, Lyadpati, and diagtype. The predictor diagtype is a target indicator variable, where 1 indicates
malignant and 0 indicates benign.
4.1 Logistic regression on training set
Divide these data into a training set (119 rows) and a testing set (58 rows). Perform. a logistic regression
using the training set.
Compute numerical and graphical summaries of the data. What patterns, if any, are present? Which of the
predictors, if any, appear to be statistically significant?
3
Compute a confusion matrix and an overall number of correct predictions on the training data. What types of
errors were shown in the confusion matrix. (Again, refer to page 145 in ISL for help with confusion matrices.)
4.2 Logistic regression on testing set
Fit your logistic model to the test data. Compute accuracy and a confusion matrix. Again, explain the
confusion matrix. How do your results here compare with those on the training set?
4.3 Linear Discriminant Analysis (LDA)
Repeat your analysis using Linear Discriminant Analysis (LDA). That is, fit the data using linear discriminant
analysis to the training data and make predictions using test data.
4.4 Quadratic Discriminant Analysis (QDA)
Repeat your analysis using Quadratic Discriminant Analysis (QDA). That is, fit the data using quadratic
discriminant analysis to the training data and make predictions using test data.
4.5 KNN, K = 1
Repeat your analysis using KNN where k=1.
4.6 KNN, K = 10
Repeat your analysis using KNN where k=10.
4.7 Experiment
Using the above techniques, experiment with different combinations, transformations, and interactions of
predictors. How did your approaches compare with the above models? Which model worked best and why?
5 Entropy (15 points)
Entropy was introduced during lecture 2. Entropy (typically denoted S) will be used in creating decision
trees later in our course. Entropy is sometimes computed using a natural log,
S =−p(x)log(p(x))
and other times is computed using log base 2 (log2() in R),
S =−p(x)log2(p(x))
4
5.1 Log base 2
Write a function in R for computing entropy using log2().
Generate at least 20 probability values between 0.025 and 1 using R’s seq() function and compute entropies
using your function. Plot these probabilities on an x-axis and the entropies on a y-axis. Compute the
maximum entropy among your probabilities and identify that point on your plot using color and text.
5.2 Natural log
Write a function for entropy using the natural logarithm, instead of log base 2. (The natural logarithm is the
default in R). As before, plot your probabilities and entropy. Which probability values yield the most entropy
on each scale (log2 and natural log)?
What are the entropy values that relate to a probability of 0.5 on both plots? Why might you favor log2 over
the natural logarithm when computing entropy?