辅导 FTEC5580 Project 1辅导 Python编程

FTEC5580 Project 1

Due by 11:59pm, March 4, 2024

Instructions:

. Prepare a single Jupyter notebook and submit it in Blackboard. The notebook should contain your code, results, explanations, and interpretations of results. In particular, don’t forget the interpretations. Only showing the numerical output is insufficient and would result in a significant loss of points.

. You can submit your work two times in Blackboard. We will NOT accept new submissions after two times regardless of your submission time.

. Late submission incurs a penalty: 10% for submission on the first day after the deadline, 20% for submission on the second day after the deadline. Submissions made on the third day after the deadline and thereafter are NOT accepted.

. Name your Jupyter notebook as “last name-first name-P1”, e.g., Li-Lingfei-P1. Please follow the naming convention strictly.

. You must work on the project independently.

. You can only use Python in Jupyter for this project.

. The TA responsible for grading this project is WANG Boyu.

In this project, we are interested in predicting monthly returns of stocks using linear regres- sion and tree-based models. Consider the following formulation:

Ri,t+1 = f(Xi,t;β) + ϵi,t+1 ,

where

. Ri,t+1 is the net return of the i-th stock in month t + 1.

. Xi,t is the vector of covariates of the i-th stock in month t.

. ϵi,t+1 is the error for the i-th stock in month t + 1. The errors of different stocks and in different months are assumed to be i.i.d.

. β is the vector of parameters in the model. It is important to note that β is assumed to be the same for all stocks. Therefore, one needs to pool observations of different stocks in different months together to estimate β .

. Compared with the single-factor and Fama-French three-factor models, here we use the values of the covariates observed in month t to predict the return for month t+1, whereas these factor models use the returns of the factors in month t + 1. In reality, we can only use the information available at the current time to predict the future.

1 Data

Data is collected from CRSP, CompuStat, and WRDS Beta Suite. All three databases can be accessed through Wharton Research Data Services (WRDS). The data contains the following types of information:

. Ticker: the tickers for 30 stocks (current constituents of the DJIA index with DOW replaced by C).

. Month: From December 2009 till November 2019. We use covariates of the current month to predict the return of the next month.

. Covariates: There are 49 covariates in total, including the current-month return and mea- sures constructed using the price data, and beta, alpha, idiosyncratic, and total volatility of a stock as well as financial ratios. You should read “explanation of some variables.xlsx”

and “WRDS Industry Financial Ratio Manual.pdf” for explanations .

. Response variable: RETN, which stands for return of the next month.

For each stock, there are 120 observations and thus the number of observations of all 30 stocks combined is 3600.

The initial data downloaded from the databases contain missing values. To deal with them, we follow a common practice to use the average value of the covariates for all other companies in the same month as a replacement of the missing values. That is, if covariate Xj of Stock i is missing in Jan 2010, then the average value of Xj of the other 29 stocks in Jan 2010 is used. After data processing, we obtain the file “data.csv”, which is ready for use.

2 Problems

(1) Consider the monthly log returns of each stock. Check whether they are normally dis- tributedat the 5% significant level using the Shapiro-Wilk test. Point out the ticker symbols for the normally distributed ones. Remark: the returns in the data are net returns.

(2) Split your data into two periods. Use the first 8.5 years of data for training and the last 1.5 years of data for out-of-sample testing. Consider OLS regression with all the covariates, LASSO, PLS, and boosted trees. If you need to tune the hyperparameter of a model, use 5-fold CV. Compare these models using the out-of-sample R-squared defined as

where ¯(r) is the sample average of the observed returns in the test period.

(3) Which covariates are more important for predicting the return of the next month?

(4) From your empirical results, do you think returns of the next month are predictable using the covariates given here? Discuss the implications of your results for the semi-strong efficient market hypothesis (click to see the definition).

Remark: The 49 covariates are on different scales. You need to standardize them before using them in any model. We can also center each covariate so that the intercept of the linear regression model can be interpreted. Thus, for any covariate X, apply both centering and