辅导Data Methods、讲解c/c++程序语言、Python，Java辅导、讲解dataset讲解数据库SQL|讲解留学生Proce

Big Data Methods
PC session 3
Part I: Empirical Part
Use the dataset “Oilfinance” for the exercises 1-21.
Ridge and lasso regression for prediction of continuous outcome
1) Define the first variable (i.e. first column) in the data matrix to be the outcome y (price
change of RTS index in % compared to one week ago) and the remaining variables to be the
predictors x (lagged levels and price changes of oil supply, stocks, indices). Show the
distribution of y by means of histogram.
2) Define a training sample containing 188 observations. Apply k-fold cross-validation (k=10) in
the training data to find optimal lambda for ridge regression (alpha=0). Report the optimal
lambda for the ridge regression.
3) In a next step, run a ridge regression (alpha=0) in the training data with the optimal lambda
and show the coefficients.
4) Predict the outcome in the test data using the optimal lambda.
5) Compute the mean squared error, the mean absolute error, and also compute the average of
the absolute y in test data (to compare it to the errors).
6) Run a ridge regression (alpha=0) in the training data with a user-provided penalty of
lambda=10 and show the coefficients.
7) Predict the outcome in the test data and compute the mean squared error.
8) Run a lasso regression (alpha=1) with the same data. Apply k-fold cross-validation (k=10) in
the training data to find the optimal lambda. Report the coefficients which are different to
zero.
9) Predict the outcome in test data using the optimal lambda.
10) Compute the mean squared error and the mean absolute error.
11) Predict the expected price change in % for mean values of x in the data.
2019 Selina Gangl
Ridge and lasso regression for prediction of binary outcome
12) Create a binary variable for y>0 (meaning that RTSindex price change is larger than zero).
13) Run a lasso logit regression for binary outcomes (setting family to binomial). Apply k-fold
cross-validation (k=10) in the training data to find optimal the lambda. Report the optimal
lambda.
14) Run a lasso regression (alpha=1) in the training data with the optimal lambda. Report the
coefficients that are different to zero.
15) Predict the outcome in the test data using the optimal lambda.
16) Recode the predicted outcome to be one if the predicted probability is larger than 50%
(=0.5). Compare this variable to the true outcomes in the test data in order to calculate the
classification error rate and share of correct classifications.
Causal inference for one regressor based on double lasso without sample splitting
17) Define "brentyl1" (price/barrel of crude brent oil in last period, i.e. one week ago) to be the
regressor d whose causal effect on y is of interest. Define the remaining regressors to be
used as potential controls x for causal analysis when estimating the effect of d on y.
18) Run a LASSO with double selection of x in the treatment and outcome equations to estimate
the causal effect of d on y. The effect of d is assumed to be homogeneous (does not depend
on values of x or d). Report the output.
19) Re-run the command with "partialling out" rather than "double selection". Report the
output.
Causal inference for one regressor based on double lasso with sample splitting
20) Apply the partialling out method with sample splitting. Use the training sample to estimate a
lasso-based model for y as a function of x and of d as a function of x based on crossvalidation.
Then estimate effect of d on y in test data. Swap the roles of the test and training
data and estimate the effect of d on y as the average of the effects in either subsamples.
Furthermore, compute the standard error of the estimated effect.
Causal inference for several regressors based on double lasso without sample splitting
21) Use the command “rlassoEffects” to estimate the causal inference for several regressors
based on double lasso without sample splitting .
2019 Selina Gangl
Lasso-based causal inference with instruments without sample splitting
Use the data “EminentDomain” from the “hdm” package. This is a dataset on judicial eminent
domain decisions and contains four sub-data sets, which differ mainly in the dependent
variables. Use the data about the non-metro (NM) area (in log).
Outcome variable (y) log house price in non-metro area of circuit
(=district)
Causal variable (d) number of pro-plaintiff appellate takings
decisions overturning government's seizure
of property in favor of private owner
(indicator for protection of individual
property rights)
Instruments (z) characteristics of randomly assigned judges
including gender, race, religion, political
affiliation,...
Control variables (x)
Define the outcome variable (y), the causal variable (d), the instruments (z), and the control
variables (x).
22) Run LASSO IV estimation for the selection of controls x and instruments z.
23) Run LASSO IV estimation for the selection of z, but take all x variables as controls.
24) Run LASSO IV estimation for the selection of x, while using all of the first 20 elements in z as
instruments
2019 Selina Gangl
Part II: Conceptual questions
Probably, we won’t have time to discuss this part in the pc session.
Anyway, you may use this part like a mock exam.
25) Compare Lasso estimation and standard OLS and comment on similarities and differences.
26) Compare ridge regression and Lasso estimation and comment on similarities and differences.
27) Explain the concept of k-fold cross-validation for picking the shrinkage factor in Lasso.
28) Explain the concept of post-Lasso double selection in OLS for performing causal inference.
29) Explain the idea of adaptive Lasso. For which reason might it be preferred over
“conventional” Lasso?
30) Explain the concept of a “sparse” model.
31) What is the advantage of shrinkage methods compared to “classical” variable selection
methods like forward selection or backwards elimination?