首页 >
> 详细

Big Data Methods

PC session 4

Part I: Regression trees

For this part of the exercise, use the data set “Carseats” from the package “ISLR”. A simulated data

set containing sales of child car seats at 400 different stores.

A data frame with 400 observations on the following 11 variables.

Sales: Unit sales (in thousands) at each location

CompPrice: Price charged by competitor at each location

Income: Community income level (in thousands of dollars)

Advertising: Local advertising budget for company at each location (in thousands of

dollars)

Population: Population size in region (in thousands)

Price: Price company charges for car seats at each site

ShelveLoc: A factor with levels Bad, Good and Medium indicating the quality of the

shelving location for the car seats at each site

Age: Average age of the local population

Education: Education level at each location

Urban: A factor with levels No and Yes to indicate whether the store is in an urban or

rural location

US: A factor with levels No and Yes to indicate whether the store is in the US ornot

Decision trees for prediction

1. Provide a histogram and summary statistics for the outcome variable “Sales”

2. Define a training sample that contains 75% of the total sample. How many observations are

in the training sample?

3. Run a regression tree in the training data to predict sales using the default values of the

“tree” command. Create a plot of the tree indicating the number and percentage of

observations in each leaf. Interpret the results.

4. Evaluate the performance of the tree in the test data based on the MSE.

5. Use k-fold cross-validation (k=10) to determine the optimal number of splits and minimize

the MSE by setting FUN=prune.tree. Report the number of terminal nodes with the lowest

cross-validation criterion.

2019 Selina Gangl

6. Prune the tree in the training data to the optimal number of terminal nodes, plot the trained

tree, and evaluate the performance of the tree in the test data based on the MSE.

7. Generate a dummy for Sales>8 which is defined as factor (or qualitative) variable and is to be

used as outcome variable in a classification tree. Merge the outcome with the rest of the dat

a.

8. Run a classification tree in the training data using the newly created factor variable as outco

me and all remaining variables but Sales as predictors. Use k-fold cross-validation (k=10) to d

etermine the optimal number of splits and minimize the classification error by setting FUN=p

rune.misclass

9. Prune the tree in the training data to the optimal number of terminal nodes. Plot the trained

tree and evaluate the performance of the tree in the test data based on the classification

error rate.

Part II: Random forests for prediction

10. Apply bagging with regression trees using 500 trees in the training data. Set mtry to the total

number of predictors for bagging. Evaluate the performance of bagging in the test data

based on the MSE.

11. Run a random forest with regression trees using 500 trees in the training data. Do not specify

mtry in this exercise. Report the most important predictors. Evaluate the performance of the

random forest in the test data based on the MSE. Use cross-validation to find the optimal

number of predictors per split.

12. Re-run the analysis with 5 predictors and compute the MSE.

13. Use k-fold cross-validation to assess model accuracy in the test data based on random forests

with regression trees. The number of folds (k) should be 5. Report the average of the MSEs

across folds to assess the overall performance.

14. Run a random forest with classification trees (considering the binary outcome) using 500

trees in the training data. Evaluate the performance of the random forest in the test data

based on the classification error rate.

2019 Selina Gangl

Part III: Random forests for causal analysis

15. Use the dataset “HMDA” (in the AER package). Find a description of the dataset below:

Outcome variable (y) Denial of mortgage (deny)

Regressor whose causal effect is of interest

(d)

Payments to income ratio (pirat)

Control variables (x) Housing expense to income ratio (hirat)

Loan to value ratio (lvrat)

Credit history: consumer payments (chist)

Credit history: mortgage payment (mhist)

Public bad credit record? (phist)

1989 Massachusetts unemployment rate in

applicant's industry (unemp)

Is the individual self-employed? (selfemp)

Was the individual denied mortgage

insurance? (insurance)

Is the unit a condominium? (condomin)

Is the individual African-American? (afam)

Is the individual single? (single)

Does the individual have a high-school

diploma? (hschool)

Edit the data for usage of the causal_forest command:

(i) Generate numerical zero/one values (rather than factors) for binary variables.

(ii) Assign variable names easy to work with (y = outcome, d = payments to income ratio, x1 =

first covariate etc.).

16. Train a model for estimating the causal effect of the payments to income ratio (= non-binary

treatment) using the command causal_forest.

17. Predict the conditional average causal effects of the payments to income ratio for each

observation in the test data.

18. Visualize the distribution of the effects by a histogram.

19. Compute the t-statistics and p-values of the conditional effects for each prediction.

20. Plot the conditional effects for different values of the payments to income ratio.

21. Provide the correlation of the conditional effects and the payments to income ratio.

22. Provide the average marginal effect and calculate the p-value.

23. Provide the average marginal effect among high school graduates and calculate the p-value.

24. Predict the conditional effect at average values of the control variables and calculate the pvalue.

联系我们

- QQ：99515681
- 邮箱：99515681@qq.com
- 工作时间：8:00-23:00
- 微信：codinghelp2

- 代写artificial课程作业、Java，Python程序语言作业调试、C 2020-05-27
- Comp Sci 3306作业代写、Python编程语言作业调试、代做jav 2020-05-27
- Data留学生作业代写、代做r课程设计作业、Analytics作业代做、R编 2020-05-27
- Csci 3120作业代做、C++程序语言作业调试、代做c/C++课程作业、 2020-05-26
- 代写algorithms作业、Data留学生作业代做、代写java、Pyth 2020-05-26
- Data Science作业代写、C++程序设计作业代写、Programmi 2020-05-26
- Data课程作业代写、C++编程设计作业调试、C/C++语言作业代做、Alg 2020-05-26
- 代写r留学生作业、代做data课程作业、代写r编程语言作业代做r语言编程|调 2020-05-25
- Cosc473作业代做、Systems作业代写、Python编程设计作业调试 2020-05-25
- Data留学生作业代做、R编程设计作业调试、R语言作业代写、Program课 2020-05-25
- Comp 250 Assignment 3 2020-05-24
- Macm 316 – Computing Assignment 7 2020-05-24
- Sta457 Assignment 2020-05-24
- Homework 10 2020-05-24
- Lab 2 Msc: Time Series Prediction With... 2020-05-24
- Comp2011作业代做、Data Analysis作业代写、C++编程语言 2020-05-24
- 代做compsys201作业、Python，Java，C/C++编程语言作业 2020-05-24
- Program留学生作业代做、Python编程设计作业调试、Data作业代写 2020-05-24
- 代写 Practical 3 Covid-19程序作业，代写... 2020-05-23
- 代写comp3059作业、代做programming作业、Java语言作业代 2020-05-23