首页 >
> 详细

Big Data Methods

PC session 4

Part I: Regression trees

For this part of the exercise, use the data set “Carseats” from the package “ISLR”. A simulated data

set containing sales of child car seats at 400 different stores.

A data frame with 400 observations on the following 11 variables.

Sales: Unit sales (in thousands) at each location

CompPrice: Price charged by competitor at each location

Income: Community income level (in thousands of dollars)

Advertising: Local advertising budget for company at each location (in thousands of

dollars)

Population: Population size in region (in thousands)

Price: Price company charges for car seats at each site

ShelveLoc: A factor with levels Bad, Good and Medium indicating the quality of the

shelving location for the car seats at each site

Age: Average age of the local population

Education: Education level at each location

Urban: A factor with levels No and Yes to indicate whether the store is in an urban or

rural location

US: A factor with levels No and Yes to indicate whether the store is in the US ornot

Decision trees for prediction

1. Provide a histogram and summary statistics for the outcome variable “Sales”

2. Define a training sample that contains 75% of the total sample. How many observations are

in the training sample?

3. Run a regression tree in the training data to predict sales using the default values of the

“tree” command. Create a plot of the tree indicating the number and percentage of

observations in each leaf. Interpret the results.

4. Evaluate the performance of the tree in the test data based on the MSE.

5. Use k-fold cross-validation (k=10) to determine the optimal number of splits and minimize

the MSE by setting FUN=prune.tree. Report the number of terminal nodes with the lowest

cross-validation criterion.

2019 Selina Gangl

6. Prune the tree in the training data to the optimal number of terminal nodes, plot the trained

tree, and evaluate the performance of the tree in the test data based on the MSE.

7. Generate a dummy for Sales>8 which is defined as factor (or qualitative) variable and is to be

used as outcome variable in a classification tree. Merge the outcome with the rest of the dat

a.

8. Run a classification tree in the training data using the newly created factor variable as outco

me and all remaining variables but Sales as predictors. Use k-fold cross-validation (k=10) to d

etermine the optimal number of splits and minimize the classification error by setting FUN=p

rune.misclass

9. Prune the tree in the training data to the optimal number of terminal nodes. Plot the trained

tree and evaluate the performance of the tree in the test data based on the classification

error rate.

Part II: Random forests for prediction

10. Apply bagging with regression trees using 500 trees in the training data. Set mtry to the total

number of predictors for bagging. Evaluate the performance of bagging in the test data

based on the MSE.

11. Run a random forest with regression trees using 500 trees in the training data. Do not specify

mtry in this exercise. Report the most important predictors. Evaluate the performance of the

random forest in the test data based on the MSE. Use cross-validation to find the optimal

number of predictors per split.

12. Re-run the analysis with 5 predictors and compute the MSE.

13. Use k-fold cross-validation to assess model accuracy in the test data based on random forests

with regression trees. The number of folds (k) should be 5. Report the average of the MSEs

across folds to assess the overall performance.

14. Run a random forest with classification trees (considering the binary outcome) using 500

trees in the training data. Evaluate the performance of the random forest in the test data

based on the classification error rate.

2019 Selina Gangl

Part III: Random forests for causal analysis

15. Use the dataset “HMDA” (in the AER package). Find a description of the dataset below:

Outcome variable (y) Denial of mortgage (deny)

Regressor whose causal effect is of interest

(d)

Payments to income ratio (pirat)

Control variables (x) Housing expense to income ratio (hirat)

Loan to value ratio (lvrat)

Credit history: consumer payments (chist)

Credit history: mortgage payment (mhist)

Public bad credit record? (phist)

1989 Massachusetts unemployment rate in

applicant's industry (unemp)

Is the individual self-employed? (selfemp)

Was the individual denied mortgage

insurance? (insurance)

Is the unit a condominium? (condomin)

Is the individual African-American? (afam)

Is the individual single? (single)

Does the individual have a high-school

diploma? (hschool)

Edit the data for usage of the causal_forest command:

(i) Generate numerical zero/one values (rather than factors) for binary variables.

(ii) Assign variable names easy to work with (y = outcome, d = payments to income ratio, x1 =

first covariate etc.).

16. Train a model for estimating the causal effect of the payments to income ratio (= non-binary

treatment) using the command causal_forest.

17. Predict the conditional average causal effects of the payments to income ratio for each

observation in the test data.

18. Visualize the distribution of the effects by a histogram.

19. Compute the t-statistics and p-values of the conditional effects for each prediction.

20. Plot the conditional effects for different values of the payments to income ratio.

21. Provide the correlation of the conditional effects and the payments to income ratio.

22. Provide the average marginal effect and calculate the p-value.

23. Provide the average marginal effect among high school graduates and calculate the p-value.

24. Predict the conditional effect at average values of the control variables and calculate the pvalue.

联系我们

- QQ：99515681
- 邮箱：99515681@qq.com
- 工作时间：8:00-23:00
- 微信：codinghelp

- Jit104留学生作业代做、代写it Systems作业、代做sql语言作业 2019-11-13
- 代做stat 3675Q作业、代写statistical课程作业、代做r程序 2019-11-13
- Csc 360作业代写、代做operating Systems作业、代写py 2019-11-13
- 代写cpt120留学生作业、代写programming课程作业、Java程序 2019-11-13
- Comp9444作业代做、Python程序设计作业调试、Python语言作业 2019-11-13
- 代做cs1026留学生作业、Analysis课程作业代写、代写python语 2019-11-13
- Lp002留学生作业代做、代写data Structures作业、代写jav 2019-11-13
- 代写comp 250作业、代做java编程设计作业、代写kdtree课程作业 2019-11-13
- Cmt107留学生作业代做、代写informatics课程作业、代做java 2019-11-12
- Econometrics作业代做、代写statistics课程作业、R程序语 2019-11-12
- 代做dataset课程作业、代写statistical Model作业、Py 2019-11-12
- 代写in1900留学生作业、代写python编程语言作业、Python实验作 2019-11-12
- Gv900留学生作业代写、代做r程序设计作业、代写r语言作业、Dataset 2019-11-12
- Cs255留学生作业代做、代写python语言作业、Timetabling作 2019-11-12
- 代做comp7510作业、Programming课程作业代写、Python， 2019-11-12
- 代写math1041作业、代做r程序语言作业、Moodle留学生作业代写、代 2019-11-12
- Comp3331作业代做、代写computer Networks作业、代做j 2019-11-12
- 代写stock Market作业、R编程设计作业代做、代写r语言作业、Dat 2019-11-11
- Cse130-01作业代做、代写c++实验作业、C/C++编程语言作业代做、 2019-11-11
- 代做module留学生作业、代写python程序设计作业、代做python语 2019-11-10