讲解data set、辅导Python编程语言、讲解Java，c/c++辅导Database|辅导Database

Big Data Methods
PC session 4
Part I: Regression trees
For this part of the exercise, use the data set “Carseats” from the package “ISLR”. A simulated data
set containing sales of child car seats at 400 different stores.
A data frame with 400 observations on the following 11 variables.
Sales: Unit sales (in thousands) at each location
CompPrice: Price charged by competitor at each location
Income: Community income level (in thousands of dollars)
Advertising: Local advertising budget for company at each location (in thousands of
dollars)
Population: Population size in region (in thousands)
Price: Price company charges for car seats at each site
ShelveLoc: A factor with levels Bad, Good and Medium indicating the quality of the
shelving location for the car seats at each site
Age: Average age of the local population
Education: Education level at each location
Urban: A factor with levels No and Yes to indicate whether the store is in an urban or
rural location
US: A factor with levels No and Yes to indicate whether the store is in the US ornot
Decision trees for prediction
1. Provide a histogram and summary statistics for the outcome variable “Sales”
2. Define a training sample that contains 75% of the total sample. How many observations are
in the training sample?
3. Run a regression tree in the training data to predict sales using the default values of the
“tree” command. Create a plot of the tree indicating the number and percentage of
observations in each leaf. Interpret the results.
4. Evaluate the performance of the tree in the test data based on the MSE.
5. Use k-fold cross-validation (k=10) to determine the optimal number of splits and minimize
the MSE by setting FUN=prune.tree. Report the number of terminal nodes with the lowest
cross-validation criterion.
2019 Selina Gangl
6. Prune the tree in the training data to the optimal number of terminal nodes, plot the trained
tree, and evaluate the performance of the tree in the test data based on the MSE.
7. Generate a dummy for Sales>8 which is defined as factor (or qualitative) variable and is to be
used as outcome variable in a classification tree. Merge the outcome with the rest of the dat
a.
8. Run a classification tree in the training data using the newly created factor variable as outco
me and all remaining variables but Sales as predictors. Use k-fold cross-validation (k=10) to d
etermine the optimal number of splits and minimize the classification error by setting FUN=p
rune.misclass
9. Prune the tree in the training data to the optimal number of terminal nodes. Plot the trained
tree and evaluate the performance of the tree in the test data based on the classification
error rate.
Part II: Random forests for prediction
10. Apply bagging with regression trees using 500 trees in the training data. Set mtry to the total
number of predictors for bagging. Evaluate the performance of bagging in the test data
based on the MSE.
11. Run a random forest with regression trees using 500 trees in the training data. Do not specify
mtry in this exercise. Report the most important predictors. Evaluate the performance of the
random forest in the test data based on the MSE. Use cross-validation to find the optimal
number of predictors per split.
12. Re-run the analysis with 5 predictors and compute the MSE.
13. Use k-fold cross-validation to assess model accuracy in the test data based on random forests
with regression trees. The number of folds (k) should be 5. Report the average of the MSEs
across folds to assess the overall performance.
14. Run a random forest with classification trees (considering the binary outcome) using 500
trees in the training data. Evaluate the performance of the random forest in the test data
based on the classification error rate.
2019 Selina Gangl
Part III: Random forests for causal analysis
15. Use the dataset “HMDA” (in the AER package). Find a description of the dataset below:
Outcome variable (y) Denial of mortgage (deny)
Regressor whose causal effect is of interest
(d)
Payments to income ratio (pirat)
Control variables (x) Housing expense to income ratio (hirat)
Loan to value ratio (lvrat)
Credit history: consumer payments (chist)
Credit history: mortgage payment (mhist)
Public bad credit record? (phist)
1989 Massachusetts unemployment rate in
applicant's industry (unemp)
Is the individual self-employed? (selfemp)
Was the individual denied mortgage
insurance? (insurance)
Is the unit a condominium? (condomin)
Is the individual African-American? (afam)
Is the individual single? (single)
Does the individual have a high-school
diploma? (hschool)
Edit the data for usage of the causal_forest command:
(i) Generate numerical zero/one values (rather than factors) for binary variables.
(ii) Assign variable names easy to work with (y = outcome, d = payments to income ratio, x1 =
first covariate etc.).
16. Train a model for estimating the causal effect of the payments to income ratio (= non-binary
treatment) using the command causal_forest.
17. Predict the conditional average causal effects of the payments to income ratio for each
observation in the test data.
18. Visualize the distribution of the effects by a histogram.
19. Compute the t-statistics and p-values of the conditional effects for each prediction.
20. Plot the conditional effects for different values of the payments to income ratio.
21. Provide the correlation of the conditional effects and the payments to income ratio.
22. Provide the average marginal effect and calculate the p-value.
23. Provide the average marginal effect among high school graduates and calculate the p-value.
24. Predict the conditional effect at average values of the control variables and calculate the pvalue.