PART 1 – Quantitative Credit Scoring 85% Section I – Variable Mapping
Go to the following link to download the dataset: PS: The first column is just index values, ignore it or don’t count it as one of the features! You have been provided a credit dataset with 20 different borrower attributes (7 numerical and 13 categorical). Description of the attributes are as follows, Attribute description:Attribute 1: (qualitative)Status of existing checking account Attribute 2: (numerical)Duration in month Attribute 3: (qualitative)Credit history Attribute 4: (qualitative)Purposeo A40 : car (new)o A41 : car (used)o A42 : furniture/equipmento A43 : radio/televisiono A44 : domestic applianceso A45 : repairso A46 : educationo A47 : (vacation - does not exist?)o A48 : retrainingo A49 : businesso A410 : others Attribute 5: (numerical)Credit amount Attibute 6: (qualitative)Savings account/bondso A61 : ... = 1000 DMo A65 : unknown/ no savings account Attribute 7: (qualitative)Present employment sinceo A71 : unemployedo A72 : ... = 7 years Attribute 8: (numerical)Installment rate in percentage of disposable income Attribute 9: (qualitative)Personal status and sexo A91 : male : divorced/separatedo A92 : female : divorced/separated/marriedo A93 : male : singleo A94 : male : married/widowedo A95 : female : single Attribute 10: (qualitative)Other debtors / guarantorso A101 : noneo A102 : co-applicanto A103 : guarantor Attribute 11: (numerical)Present residence since Attribute 12: (qualitative)Propertyo A121 : real estateo A122 : if not A121 : building society savings agreement/life insuranceo A123 : if not A121/A122 : car or other, not in attribute 6o A124 : unknown / no property Attribute 13: (numerical)Age in years Attribute 14: (qualitative)Other installment planso A141 : banko A142 : storeso A143 : none Attribute 15: (qualitative)Housingo A151 : rento A152 : owno A153 : for free Attribute 16: (numerical)Number of existing credits at this bank Attribute 17: (qualitative)Jobo A171 : unemployed/ unskilled - non-residento A172 : unskilled - residento A173 : skilled employee / officialo A174 : management/ self-employed/highly qualified employee/ officer Attribute 18: (numerical)Number of people being liable to provide maintenance for Attribute 19: (qualitative)Telephoneo A191 : noneo A192 : yes, registered under the customers nameAttribute 20: (qualitative)foreign workero A201 : yeso A202 : noThe 21st column denotes the credit status of the borrower, with 1 being good and 2 being bad.Your first task will be to transform. the provided the dataset to the respective categorical variables based on the attributes mentioned above.Section II – Exploratory Data Analysis WranglingYour next task will be to conduct exploratory data analysis. 1) Observe the variables by plotting histograms and box plots (for continuous variables) and frequency tables (bar plots) for categorical variables. Conduct proper outlier detection (if any) and use necessary tools taught in the class to treat outliers. Present your analysis.2) You may notice that the categorical variables contain missing values in some instances. It will be advisable to conduct proper data imputation instead of dropping the rows entirely which will result in loss of information. Use your judgement to conduct proper imputation. Keep in mind that not all the variables are of the same type (some continuous and some categorical), so use proper judgement and methods to treat these two types differently to conduct imputation. Present your analysis.3) Conduct cross tabulation of all the categorical predictors with the credit status of the borrower. For this you need to create cross contingency table. For example, if we take a categorical variable with 3 categories1, 2, and 3, the cross contingency table will take the following form,Credit Status123Row Total0# of 1 in 0(and % of 0)# of 2 in 0(and % of 0)# of 3in 0(and % of 0)Total 01# of 1 in 1(and % of 0)# of 2 in 1(and % of 0)# of 3 in 1(and % of 0)Total 1Column TotalTotal 1Total 2Total 3Sub – total For each table, present an outline of the analysis. 4) Using the cross contingency table, perform. a chi-square test in order to test the dependence of the categorical variables with the credit status of the borrower. Take note of the variables which have a statistically significant dependence with the response variable.5) For the continuous variables, present the necessary descriptive statistics. Make sure to standardize the continuous variables before moving to estimation. Also a correlation matrix of all the variables (categorical and continuous) will give us a good idea of the dependence structure of the dataset.Section III – EstimationBefore moving to the estimation phase, it is important to not use the full data for estimation. Conduct a 70:30 cross validation, which means randomly sample 70% of the data as training set and keep the rest of 30% as test set.1) Start by estimating a logistic regression using all the significant categorical predictors based on the chi square tests and all numeric variables. At every iteration, take out the insignificant variables and re–estimate until all the variables are significant at the 5% level. Briefly explain the final chosen variables, the signs of the coefficients.a. For the logistic regression, you must build your own function, which includes constructing the functions for the logistic distribution, the log likelihood and the optimization process. Take help from Appendix A3 of the text book to build your own function for Newton’s method, since you need the Hessian matrix to calculate relevant regression statistics.b. Plot the ROC curve for the in sample prediction. Use built in packages/function to extract the ROC curve. Perform. a Kolmogorov Smirnov test (KS) on the possible true positive rates and false positive rates for each cut off value. Pick the appropriate cutoff value based on the KS test. Use this cut off value for out of sample prediction. (This will be important in Part IV for calculating Brier Score and conducting HL test)2) Estimate a stepwise logistic regression model. Present your output results. Briefly explain the signs of the coefficients and their significance.a. Take help from the slides to get an idea on how to approach the algorithm for the stepwise functions. Feel free to create only one method, either the backward or the forward.b. Perform. the KS test as usual to obtain the optimal cut off value. Use this cut off value for out of sample prediction.3) Estimate a decision tree on the given dataset. Present your results.a. For this question, feel free to leverage the libraries or built in function for estimation.b. Apply cost complexity pruning to the large tree obtained in order to obtain a sequence of sub-trees. Conduct a K-fold cross validation.c. Find the complexity parameter for which the cross validated error rate is minimum.d. Prune the tree using this complexity parameter. This way, you will obtain the final tree.14) Estimate a random forest on the given dataset. Set the forest size to 1000 trees and evaluated variables per node to 5. For more information on the methodology and how to estimate it in R, go over this document:a. For this question, feel free to leverage the libraries or built in function for estimation.b. Present the necessary results.c. Plot the variables based on their importance, do the results make sense?? Section IV – Performance Validation1) For each model estimated in part 3, predict default probability for each method using the test set.2) Build a function for the Brier score2 and calculate the score for each prediction result.3) Build a function for Hosmer – Lemeshow Test3 and use it to test each prediction result.4) Plot the ROC curve for each method, also calculate the AUC for each out of sample prediction.5) Which model would you recommend as your final model?1 For a brief on K fold cross validation, please see: 2 For a brief on brier score, 3 A good primer, PART 2 – Fundamental Credit Analysis 15%You are an analyst for a bond hedge fund or a credit analyst or working in a Pension fund (private asset classes’ investment division or Investment Risk). The goal of this part of the project is to get you familiar with some concepts/approaches for credit analysis. Especially what is known under Fundamental Credit analysis or some places call it Internal rating approach.— To begin, pick four ‘comparable’ companies within the same industry, and have the same rating and have bonds outstanding preferably with approximately same time to maturity.— To make your life easier, all bonds should be denominated in USD/CAN, and issued by firms domiciled either in CA or the US.— You can use the Bloomberg SRCH function to search for bonds and firms. See these links for details: Further,· Given the industry you choose, create internal ratings of the firms in your sample. I have shared several Moody’s reports by industry that talks about the important factors for your credit analysis. You can find them here You can follow Moody’s methodology. You can also complement the analysis using the 5 steps of credit analysis that I mentioned in class to guide your thinking. Basically I am looking for Fundamental analysis.o The document of Moody’s attached, should help you understand what ratio is more important than other in each industry.o Augment your credit/internal rating of the firms by adding recovery analysis and have a more informed internal ratings that accounts for both likelihood of default and severity of default: ( if you think the security value at time of default is not represented in the rating methodology you are following )o You can also compute Z Ohlson score. They are helpful to give some comparison across companies even though you will notice that your internal ratings may dominate those measures. Comment on that.o Overall consider factors that affect the potential Default probability (cash flow, capacity to repay and covenants strictness with respect to the other firms you chose or with respect to the industry)o Consider factors that affect the potential Recovery rate (collateral, tangible assets, priority, other sr. or jr. debt in the capital structure, etc.).· What can you learn from the above information about the differences in bond yields of these firms?· Based on your internal rating assigned, how do you assess these companies relative to the rating provided in Bloomberg. Explain your rationale of the deviation in you internal rating methodology and that of provided by the rating agencies.