辅导Economics 4P05、讲解Statistical learning、Python，Java，c/c++程序语言辅导讲解留学生Proce

Statistical learning
Department of Economics
Brock University
1 Assignment 2
1.1 Conceptual questions
1. Suppose that we wish to predict whether a given stock will issue a dividend this year
(“Yes” or “No”) based on X, last year’s percent profit.We examine a large number of
companies and discover that the mean value of X for companies that issued a dividend
was 10, while the mean for those that didn’t was 0. In addition, the variance of X for these
two sets of companies was 36. Finally, 80% of companies issued dividends. Assuming
that X follows a normal distribution, predict the probability that a company will issue a
dividend this year given that its percentage profit was X = 4 last year. Use equation (1)
from your notes on classification.
• This problem has to do with odds. On average, what fraction of people with an odds of
0.37 of defaulting on their credit card payment will in fact default?
1.2 Classification methods
This question should be answered using the Weekly data set, which is part of the ISLR package.
This data is similar in nature to the Smarket data except that it contains 1, 089 weekly returns
for 21 years, from the beginning of 1990 to the end of 2010.
1. Produce some numerical and graphical summaries of the Weekly data. Do there appear
to be any patterns? For the numerical summaries focus on the means of the returns
(today and all lags) as well as on the correlation between today’s returns and the lags.
For the graphical summaries create a plot of today’s return versus its first lag and discuss.
2. Use the full data set to perform a logistic regression with Direction as the response and
the five lag variables plus Volume as predictors. Use the summary function to print the
results. Do any of the predictors appear to be statistically significant? If so, which ones?
Compute the predicted probabilities and obtain the following features: min, max, mean.
Discuss those features.
3. Compute the confusion matrix and overall fraction of incorrect predictions. Explain
what the confusion matrix is telling you about the types of mistakes made by the logistic
regression.
4. Use the full data set to perform a LPM regression with Direction as the response and
the five lag variables plus Volume as predictors. Use the summary function to print
the results. Do any of the predictors appear to be statistically significant? If so, which
ones? Compute the predicted probabilities and obtain the following features: min, max,
mean. Discuss those features. Are the LPM probs sensible? Are they similar to those of
the logistic regression? Do you expect the confusion matrix to be similar to that of the
logistic regression?
5. Compute the confusion matrix and overall fraction of incorrect predictions for this LPM.
Is the matrix similar to the one obtained with the logistic regression?
6. Now fit the logistic regression model using a training data period from 1990 to 2008, with
Lag2 as the only predictor. Compute the confusion matrix and the overall fraction of
incorrect predictions for the held out data (that is, the data from 2009 and 2010).
7. Repeat (6) using LDA.
8. Repeat (6) using KNN with K = 1.
9. Which of these methods (logistic, LDA or KNN) appears to provide the best results on
this data? Why?
1.3 Cross-validation
In this question you will use the glm() and predict() functions, and a for loop to compute the
LOOCV error for a simple logistic regression model on the Weekly data set.
1. Fit a logistic regression model that predicts Direction using Lag1 and Lag2.
2. Fit a logistic regression model that predicts Direction using Lag1 and Lag2 using all but
the first observation.
3. Use the model from (2) to predict the direction of the first observation. You can do this by
predicting that the first observation will go up if P(Direction = ”U p”|Lag1, Lag2) > 0.5.
Was this observation correctly classified?
4. Write a for loop from i = 1 to i = n, where n is the number of observations in the data
set, that performs each of the following steps:
i. Fit a logistic regression model using all but the ith observation to predict Direction
using Lag1 and Lag2.
ii. Compute the posterior probability of the market moving up for the ith observation.
iii. Use the posterior probability for the ith observation in order to predict whether or
not the market moves up.
iv. Determine whether or not an error was made in predicting the direction for the ith
observation. If an error was made, then indicate this as a 1, and otherwise indicate
it as a 0.
5. Take the average of the n numbers obtained in (4)iv in order to obtain the LOOCV
estimate for the test error. Comment on the results.
Notes:
• Have a look at the Course Outline (on Sakai) for more info on how to create tables.
• The report must be typed.
• The report should have a titlepage, be single space and typed using a font of size 12.
• Your computer code and output should be included in the appendix.
• Pay attention to your graphs.
• Descriptive statistics, when applicable, should be reported in a table.
• Regression results should also be presented in a Table. The first column of your table
would contain the list of independent variables (starting with the constant). The remaining
columns would contain the results for the different models. The last few rows of the
table should contain: the sample size, and 2 measures of goodness of fit.
• When using a test statistic, report the null being testing, the formula for the test statistic
and how it was computed (eg using a regression and if so which regression). Make sure
to report a conclusion for that test (eg, I reject the null because XXXX and this implies
that XXXX).