调试R、R程序讲解、辅导R、Machine Learning讲解、辅导R、Machine Learning语言讲解留学生

APANPS4335: Machine Learning
Due February 10 by 11:55pm
Directions: please submit your homework as two files — .Rmd and .pdf — on the Canvas class website.
Include short explanations of your code and results throughout.
1 Bias-Variance Tradeoff (15 points)
It is easy to obtain a method with extremely low bias but high variance (for instance, by drawing
a curve that passes through every single training observation) or a method with very low variance
but high bias (by fitting a horizontal line to the data).
— James, et al., An Introduction to Statistical Learning with Applications in R (Springer, 2013).
The challenge, according to James, “lies in finding a method for which both the variance and the squared bias
are low.” In this exercise you will identify different elements of the bias-variance tradeoff in the figure below.
Identify the 9 labels in the figure above.
For the questions below, Explain why you selected ‘True’ or ‘False’.
1.2
True or False. The y-axis value where the green line (B.) crosses the dotted line (D.) added to the y-axis
value where the red line crosses (C.) the dotted line exactly equals the value where the black line (A.) crosses
the dotted line?
True or False. The left side of the equation in Figure 1 (i.e., E[(y− ˆf(x))2] will never be negative.True or False. In general, we expect Var(ˆf(x)) in the equation to be smaller when we use more flexible
methods.True or False. As we move from less flexible to more flexible methods, [bias(ˆf(x))]2 will usually not increase.True or False. More data will reduce the bias of an estimator.
True or False. More data will reduce the variance of an estimator.
2 K Nearest Neighbor (20 points)
2.1 Euclidean distance (5 points)
Compute the euclidean distance for all points with respect to U, the point we want to predict, in the plot
above. Here are the data used for the above plot,
x HW2). Suppose the model yi =β0 +β1xi +epsilon1 for i = 1, 2, 3,
.....n is used to model the relationship between the number of number of minutes required for a service call
and the number of machines serviced.
3.1
Estimate β0 and β1 using the least square method. Interpret the estimate of β1.
3.2
Use a 95% confidence interval to estimate β1. Interpret your result in words, e.g., “We are 95% confident
that....”
3.3
Estimate the average time it will take to serve 6 microcomputers using a 95% confidence interval. Interpret
your result in words, e.g., “We are 95% confident that on average.....”
3.4
Compute a 95% prediction interval for the amount of time it will take to service 6 microcomputers. Interpret
your result in words. [This is a prediction interval, not a confidence interval. See page 82 of ISL for the
difference between the two.]
3.5
Hypothesis test. Is X associated with Y?
If β1 = 0 then our model reduces to Y = β0 +epsilon1, and X is not associated with Y. To test the null hypothesis,
we need to determine whether ˆβ1, our estimate for β1, is sufficiently far from zero that we can be confident
that β1 is non-zero. (ISL, p. 67). Here we want to test:
H0 : ˆβ1 = 0Ha : ˆβ1negationslash=H0
using
α = 0.05
and
t=
ˆβ1−0
SE(ˆβ1)
Show all steps.
4
4 Multiple Linear Regression (25 points)
4.1
How do demographic factors impact cigarette sales in the U.S? Load cig_sales.txt which you will find in
Canvas in the “Files” tab under HW2.
4.2
Which of the eight variables are qualitative? Which are quantitative?
4.3
Produce a scatterplot matrix using all quantitative variables. Do any of the variables appear to be well
correlated with Sales? Which two variables are most correlated with Sales?
4.4
Delete the State variable. Perform. a multiple linear regression. Use Sales as your response variable and the
remaining variables as predictors. Produce a summary of your fit. Does there appear to be a relationship
between the predictors and target? Explain your answer with reference to summary statistices (e.g., R2, F
adjusted R2, p_values, et cetera). Which if any of the predictors are statistically significant?
4.5
If we included a higher proportion of females in the study, would you expect cigarette sales to increase or
decline? Conversely, if we lowered the price of cigarettes, would you expect cigarette sales to increase or
decline? By how much?
4.6
Produce diagnostic plots of your model from the fitted data using plot(fit).
• Can you identify any potential outliers? Which observations?
• Look at the leverage plot. Would you consider eliminating any observations based on your Cook’s
Distance results?
• Explain your justification for eliminating (or not eliminating) observations with reference to Cooks
Distance computation (e.g., what threshold are you using?).
4.7
Experiment with interactions. Which of your interactions did best? Were you able to find any interactions
that were statistically significant?
5
5 F statistic and R2 (20 points)
Consider a regression model with p predictors. That is:
Yi =β0 +β1X1 +β2X2 +...+βpXp +epsilon1i, where i ={1,2,...,n}
Show that
F = n−p−1p R
2
1−R2
For including math notion in an R Markdown file, you can place latex math notation between outside code
chunks using two pairs of dollar signs like so,
$$
\sum\limits_{I=1}^{10}{x_i}
$$
Or between a single pair to keep the notation within the sentence, like this: $\beta$, which shows β.
You can get latex code easily using online math editors like here: http://visualmatheditor.equatheque.net/
VisualMathEditor.html.