讲解Python、Java编程辅导、调试BIG DATA ANALYTICS 编程

BIG DATA ANALYTICS
Analysis of a German Bank and Its clientele
M2C-UB3, Montpelier Busines Schol, December 2015
Abstract
In this report we use big data analytics techniques to understand the bank and
its clientele from diferent perspectives (Part2), to try to come up with
regresion models (Part3) and ecisions tres (Part4) that could facilitate the
determination of either the optimal credit rate or the future client risk, and to
identify nearest neighbors to the suboptimal client profile in order to put them
into a refinancing program that would help them get out of debt (Part5)
The Dataset

For this study, a slightly modified version of the “German.xls” dataset (available at the Iain Pardoe
website, from the University of Oregon) was utilized. The modified file was renamed
“German3.xlsx” and it is presented as an extra delivery file. The changes made to the original file
were made with the objective of being able to replicate more of the big data techniques seen in
clas during the diferent sesions. The “German3.xlsx” file contains the folowing working shets:
1- Codelist: Explanation of the attributes in the dataset
2- Data: Dataset itself with 32 attributes and 100 observations (~clients)
3- Job: Client job types
4- Purpose: Credit purpose types
5- Mar_Stat: Client marital status
6- Hist: Client credit history
7- Origin: Client origin (national versus foreigner)
8- cor(data3): Corelation matrix extracted to excel for easier visualization
The “German3.xlsx” file contains 100 observations (interpreted as 100 diferent clients for our
purposes) of credit lines for an undisclosed German bank. 32 attributes are associated to these
1000 different clients. Some of these attributes are client age, job type, marital status, checking
and saving accounts balances, credit history, purpose of the credit, credit rate, credit duration,
credit amount. For the complete list of attributes please refer to the codelist working shet. It
might be useful to clarify that the dataset is rather old and stil has Deutsche Mark as the curency
for Germany (therefore created before 202). This fact is acknowledged but not too critical for
our purposes as the reader wil understand while reading this report, runing the code and
looking at the graphs.

The Analysis

The code written for this big data analysis (copied at the end of this report and presented as an
extra delivery file) was split into five diferent parts and are clearly indicated in the code itself for
easier navigation, execution and interpretation. The five parts corespond to the four (4) big data
analytic techniques sen in class (general analytics, regresions, decision tres, and nearest
neighbors) plus an initial section for installation and activation of all the required R packages.

In Part 2, the general analytics techniques were used for two main purposes, understanding the
bank and understanding the clients and their motivation behind the credits. In order to do that,
however, part 2.2 shows some initial manipulation of the dataset that was required in order to
simplify the analysis. This manipulation resulted in (a) the reduction of the observations (clients)
from 100 to 652 after the elimination of apparently erroneous information, and (b) the
generation of two extra attributes for a new total of 34. Part 2.3 presents the bank through some
basic histograms of the complete dataset. By doing this, it was posible to understand that:
(i) most of the loans given by the bank are short duration loans of 5 to 25 months (se
Figure 1) even tough there are loans of up to 60 months
(ii) most of the loans are for small amounts of 1,00 to 5,00 DM (se Figure 2) even
though they also give credits of more than 15,00 DM
(iii) most of the clients are young people betwen 25 and 40 years old (se Figure 3) even
though clients borowing money can also be 70+ years old
(iv) the bank gives loans to people with all types of credit histories but most of then are a
credit history of 2 “clients with existing credits paid back duly until now” (see Figure
4 and Codelist sheet, attribute #4)
(v) the most frequent credit rate ofer to clients is 4% (se Figure 5)
(vi) the most frequent purpose for credit is to buy a TV/Radio folowed buying a new car
(see Figure 6)
(vii) There is not a clear relationship betwen credit rate and amount borrowed or
duration (se Figure 7), or betwen credit history and amount borowed or rate (se
Figure 8), but there is -as expected- a corelation betwen amount and credit duration
(see Figure 9)

In Part 2.4 barplots were used to ilustrate the bank’s social responsibility profile. Figure 10, nicely
shows how averaged credit rates have an inverse relationship with job type (which is a proxy for
income, se codelist shet, attribute #28), therefore averaged credit rate decreases as client
income decreases, which is socially positive. Figure 1 shows that averaged credit rates are
smaller for foreigners than for national. Figure 12 shows that averaged credit rates decrease as
the number of dependents increase, which is also socially positive. Finally, Figure 13 shows how
the averaged credit rates are the same independently on the client credit history, showing no
discrimination based on history. In Part 2.5 gplot based techniques allowed for a better
understanding of client’s motivations when getting a loan. It is crucial to understand the clients’
motivation in order to better serve them. Credit purposes distribution plots based on income
(Figure 14), marital status (Figure 15) and origin (Figure 16) provided the folowing potentially
interesting insights:
a. Most of the unemployed clients want to get a new car, while clients with higher income
have a more diversified distribution of purposes.
b. RadioTV is the main purpose (>50%) for loans among married clients, while RadioTV &
NewCar are the main purposes (~50%) for loans among single clients
c. NewCar is the main purpose (>70%) for loans among foreign clients

Part 3 is regressions. In Part 3.1, the author started with the generation of the overall corelation
matrix in which covariance-based corelation factors were calculated for every pair of the
attributes available. Figure 17 shows a useful way to highlight corelated attributes, in which a
perfect corelation is shown as white cels along the matrix diagonal. As an example, notice the
high corelation betwen attribute #17 and #34, which simply indicates that most of the clients
are married. It is acknowledged that given the way this dataset was originally set up, most of the
attributes do not show a god corelation (most colors in Figure 17 are hot). In Part 3.2, the
author tried to find a god regresion model to estimate the credit rate that would be ofered to
clients, first starting with all attributes, and later with attributes that showed a god corelation
with the forecasted variable. As the erors when forecasting credit rates were stil too high, extra
eforts were made to create two additional attributes named Inverse (#35) and Direct (#36). The
idea was to generate more robust credit risk attributes that would result in higher forecasting
accuracy (se a more complete explanation on the code itself). Figure 18 shows the result of the
best forecasting model, which is obviously not a sucesful model. Part 3.3 as in intend to forecast
the response from the bank (se Codelist shet, attribute #32). Unfortunately, not a single
attribute was found to be corelateable to that attribute (adjusted R-squared = 0.21), which
sugest that the response of whether or not the credit line would be approved by the bank is a
rather complicated logic. Part 3.4, on the other hand, was a very sucesful forecasting exercise.
Here, credit history was taken as a proxy for future credit risk. The premise here is that a client
that always paid his credits fully (and on time) has high probabilities of being a low risk client in
the future. Four diferent regresion were run for this particular purpose and the result of the
best model (with an adjusted R-squared of 0.97) is shown in Figure 19. Notice the high accuracy
betwen the actual and the estimated credit risk. Notice as wel that the results of every
regresion tested is reported in the code itself as comments. This is done in case the reader does
not have the capabilities to simply run the code.

In Part 4 decisions tres were created in order to estimate both the credit rate ofered to a
particular client as wel as the future credit risk (or risk profile, again using credit history as proxy).
This section was split into two. Part 4.1 presents tre diferent decisions tres to determine credit
rate (se Figures 20, 21, 2). One tre is an improvement over the last one, trying to find the
perfect balance betwen accuracy and optimal visualization (very complex tres –normally very
accurate- are hard to interpret). Notice that the interpretation of every tre is reported in the
code itself as comments, indicating both how to navigate it (right or left) and the diference
betwen the estimated and the actual credit rate. Part 4.2 presents two tres to easily determine
client risk. Figures 23 and 24 show these tres and, as before, the coments in the code indicate
how to navigate it and show how useful they are.

Finally, in Part 5, the author decided to use the nearest neighbors technique to identify high risk
clients or suboptimal client profiles with the idea of putting them into a re-financing program to
help them get out of debt. In order to do this, it was necesary to define and locate the client
with the higher risk. First, all clients with accounts curently classified as “critical” (se codelist
shet, attribute # 4) were selected. Then, among these critical clients, the worst profile was
identified by finding the client with the higher sumation of risk attributes (e.g. no money in
checking or saving accounts, unemployed, owns no residence, does not have a co-applicant or a
guarantor that could minimize its risk, etc.). By doing this, the client with ID # 31 was identified
as the client with higher risk. The 10 nearest neighbors to the client with ID #331 were then
identified and the result of that analysis could be presented to bank managers so that their credit
rate or the duration of their credit can be revisited. Figure 25 shows the histogram of the
calculated distances to the client with higher risk. The code (also copied at the end of this report)
has the tables that should be presented to management as comments.
The Code
# Project for Big Data Analytics Course @ Montepelier Busines Schol. M2C-UB3.
# Code written by David Ramírez () during December 2015
# Total Code length = 67 lines, Efective Code Length (Total - Coments) = 232 lines
#
# --------------------------------------------------------------------------------------------------------------------------
############################# PART 1: INITIATION ###############################
# --------------------------------------------------------------------------------------------------------------------------
#
#install.packages("openxlsx") # Installing required packages
#install.packages("gplot2")
#install.packages("rpart")
library(openxlsx) # Activating required packages
library(gplot2)
library(rpart)
#
# --------------------------------------------------------------------------------------------------------------------------
######################### PART 2: GENERAL ANALYTICS ############################
# --------------------------------------------------------------------------------------------------------------------------
# 2.1 Reading Input Data
# --------------------------------------------------------------------------------------------------------------------------
#
data1 50%) for loans among married clients
# LEARNING: RadioTV & NewCar are the main purposes (~50%) for loans among single clients
# LEARNING: Furniture, NewCar, Retraining are the main purposes (~75%) for loans among
divorced clients
#
# 2.5.3 Credit Purposes Distribution based on ORIGIN [31]
#
dist 70%) for loans among foreign clients
# LEARNING: Loan purposes for nationals are more distributed, but Radio dominates
#

# ----------------------------------------------------------------------------------------------------------------------------
############################## PART 3: REGRESIONS #############################
# --------------------------------------------------------------------------------------------------------------------------
# 3.1 Overall Attributes Corelation
# --------------------------------------------------------------------------------------------------------------------------
#
# Analyzing corelation among all attributes
cor_m |t|)
# (Intercept) 2.436e+00 3.489e-01 6.984 7.30e-12 OK
# CHK_ACT -2.006e-02 3.197e-02 -0.628 0.53050 (UNCORRELATED)
# DURATION 3.205e-02 3.984e-03 8.044 4.31e-15 OK
# HISTORY 2.14e-02 4.041e-02 0.523 0.60105 (UNCORRELATED)
# AMOUNT -2.121e-04 1.754e-05 -12.08 |t|)
# (Intercept) 2.938e+00 7.720e-02 38.053 |t|)
# (Intercept) 2.605e+00 1.86e-01 13.814 |t|)
# (Intercept) 5.213e-01 1.683e-01 3.098 0.02037 (UNCORRELATED)
# CHK_ACT 2.301e-01 8.119e-02 2.834 0.04740 (UNCORRELATED)
# DURATION -6.421e-03 1.673e-03 -3.838 0.00137 (UNCORRELATED)
# HISTORY 2.753e-01 8.714e-02 3.160 0.01656 (UNCORRELATED)
# AMOUNT -7.554e-07 7.332e-06 -0.103 0.917971 (UNCORRELATED)
# SAV_ACT 1.724e-01 8.060e-02 2.139 0.032815 (UNCORRELATED)
# EMPLOYMENT 1.50e-01 8.180e-02 1.895 0.058595 (UNCORRELATED)
# CO_APLICANT 8.853e-02 1.199e-01 0.738 0.460628 (UNCORRELATED)
# GUARANTOR 2.306e-01 1.063e-01 2.168 0.030497 (UNCORRELATED)
# TIME_RES 1.426e-02 1.644e-02 0.867 0.386137 (UNCORRELATED)
# REAL_ESTATE 2.172e-01 8.861e-02 2.451 0.014518 (UNCORRELATED)
# PROP_NONE 5.959e-02 6.821e-02 0.874 0.382618 (UNCORRELATED)
# AGE -1.391e-03 1.637e-03 -0.850 0.395681 (UNCORRELATED)
# MORE_INSTAL 1.307e-01 9.525e-02 1.372 0.170476 (UNCORRELATED)
# NUM_CRED 1.917e-01 9.028e-02 2.123 0.034121 (UNCORRELATED)
# JOB 1.64e-01 8.459e-02 1.968 0.04959 (UNCORRELATED)
# NUM_DEP 2.351e-01 9.345e-02 2.516 0.012125 (UNCORRELATED)
# FOREIGN 1.408e-01 7.952e-02 1.71 0.0708 (UNCORRELATED)
# PURPOSE 5.64e-03 1.009e-02 0.561 0.574910 (UNCORRELATED)
# MAR_STAT 3.395e-02 3.651e-02 0.930 0.352847 (UNCORRELATED)
# INVERSE -1.351e-01 7.953e-02 -1.69 0.089864 (UNCORRELATED)
# DIRECT -2.236e-01 8.566e-02 -2.61 0.09253 (UNCORRELATED)
# ---
# Multiple R-squared: 0.239, Adjusted R-squared: 0.2136
# F-statistic: 9.423 on 21 and 630 DF, p-value: |t|)
# (Intercept) -7.615e-01 7.273e-02 -10.470 |t|)
# (Intercept) -0.740894 0.048156 -15.38 |t|)
# (Intercept) -0.881975 0.119906 -7.356 5.76e-13 OK
# INVERSE 0.04969 0.09124 5.47 7.28e-08 (UNCORRELATED)
# DIRECT 0.53508 0.016480 32.470 |t|)
# (Intercept) 1.1947649 0.2175912 5.491 5.7e-08 (UNCORRELATED)
# CHK_ACT 0.09947 0.0310436 3.219 0.0135 (UNCORRELATED)
# SAV_ACT 0.0301495 0.024013 1.256 0.20971 (UNCORRELATED)
# EMPLOYMENT 0.051230 0.0318907 1.606 0.10872 (UNCORRELATED)
# CO_APLICANT -0.0052416 0.1979202 -0.026 0.9788 (UNCORRELATED)
# GUARANTOR -0.1243610 0.1683493 -0.739 0.46036 (UNCORRELATED)
# REAL_ESTATE 0.1785326 0.0892164 2.01 0.04580 (UNCORRELATED)
# PROP_NONE -0.0000767 0.1042335 -0.01 0.9941 (UNCORRELATED)
# MORE_INSTAL -0.4052822 0.0941640 -4.304 1.94e-05 (UNCORRELATED)
# NUM_CRED 0.836025 0.063892 13.086 right/left -> Rate= 2.4 vs. TrueRate= 4
c(data3[652,1],data3[652,3],data3[652,14]) # Printing c(AMOUNT,DURATION,RATE) for
index=652
# c(4576 45 3) -> left/right -> Rate= 2.87 vs. TrueRate= 3
#
# LEARNING: Model neds higher precision. Big erors estimating rate for Ex1.
#
# 4.1.2 Tre #2 for Credit Rate (two attributes) cp=0.08
#
fitT_rate1 right/left/right -> Rate= 3.4 vs. TrueRate= 4
c(data3[652,1],data3[652,3],data3[652,14]) # Printing c(AMOUNT,DURATION,RATE) for
index=652
# c(4576 45 3) -> left/right/right -> Rate= 3.3 vs. TrueRate= 3
#
# LEARNING: With higher precision, Ex1 improved, Ex652 is slightly worse than before
# LEARNING: Best to use other attributes to build a more accurate decision tre
#
# 4.1.3 Tre #3 for Credit Rate (more attributes) cp=0.08
#
fitT_rate2 right/right -> Rate= 3.64 vs. TrueRate= 4
c(data3[652,1],data3[652,28],data3[652,13],data3[652,21],data3[652,14]) # printing for
index=652
# c(4576 2 0 0 3) -> left/right/left -> Rate= 2.7 vs. TrueRate= 3
#
# LEARNING: Overall, fitT_rate2 is a better decision tre.
# LEARNING: Higher precision wil result in decision tres hard to visualize
#
# # --------------------------------------------------------------------------------------------------------------------------
# 4.2 Decision Tres for determining CREDIT HISTORY (Proxy for Future Risk)
# # --------------------------------------------------------------------------------------------------------------------------
#
# 4.2.1 Tre #1 for Credit History (two attributes: INVERSE & DIRECT) cp=0.01
#
fitT_c_his1 right/right -> 'Risk'= 3.7 vs. 'TrueRisk'= 4
c(data3[652,36],data3[652,35],data3[652,4]) # Printing c(DIRECT,INVERSE,HISTORY) for
index=652
# c(6 5 4) -> right/left/left -> 'Risk'= 2.74 vs. 'TrueRisk'= 4
#
# LEARNING: This very simple model did ok, Ex652 was not accurate
#
# 4.2.2 Tre #2 for Credit History (More attributes) cp=0.08
#
fitT_c_his2 right/right/left/left/right -> 'Risk'= 2.9 vs. True'Risk'= 4
c(data3[652,36],data3[652,27],data3[652,29],data3[652,24],data3[652,4]) # printing for
index=652
# c(6 1 1 0 4) -> right/left/right/right/right -> 'Risk'= 3.6 vs. True'Risk'= 4
#
# LEARNING: fitT_c_his2 presented an overall marginal improvement over fitT_c_his1
# LEARNING: Higher precision wil result in decision tres hard to visualize
#
# --------------------------------------------------------------------------------------------------------------------------
########################## PART 5: NEAREST NEIGHBORS #########################
# --------------------------------------------------------------------------------------------------------------------------
#
# Selected attributes for the Neareast Neightbors calculation are below. It is based on
# the logic that the smaller the attribute values the higher the risk profile of the client.
# We want to identify high risk clients to put them in a special program to help them pay.
#
#[2]"CHK_ACT"
#[12]"SAV_ACT"
#[13]"EMPLOYMENT"
#[18]"CO_APLICANT"
#[19]"GUARANTOR"
#[21]"REAL_ESTATE"
#[26]"OWN_RES"
#
crit_acc 0] # the check
hist(temp_d3,15) # generating histogram
temp_r 1){
badclient <- badclient[1]
} else {
badclient <- badclient # ID=331 is the reference badclient profile
} # getting ONE badclient as reference
k3 <- 10 # Nb of neighbors to be identified
IDs3 <- vector() # Storage
dist3 <- d3[badclient,] # Distances to badclient
dist3[badclient] <- 9 # Replacing the zero from badclient by 9 (to leave it out)
for(i in 1:k3) { # Lop on k3
i_m3 <- match(min(dist3),dist3) # Loking for row numbers that are iqual to the min(dist3)
IDs3[i] <- crit_acc[i_m3[1],1] # Getting the name of the Client of first min found
dist3[i_m3[1]] <- 9 # Replace min that was already found
} # Note: Min can change, we are replacing & calculating every time
IDs3 # Write out ID of clients close to the reference 'badclient'
#
# IDs3 = 126 36 398 457 495 517 536 579 587 618
#
ind_ii <- vector() # creating vector to store data
for (i in 1:length(IDs3) { # loping over length of IDs3
ind_i <- which(crit_acc[,1] == IDs3[i]) # finding indices associted to IDs3
ind_ii <- c(ind_ii,ind_i) # concatenating indices
}
#
crit_acc[badclient,c(2,12,13,18,19,21,26)] # Printing attributes for Badclient (reference)
crit_acc[ind_ii,c(2,12,13,18,19,21,26)] # Printing attributes for nearest neightbors (to
compare)
#
# CHK_ACT SAV_ACT EMPLOYMENT CO_APLICANT GUARANTOR REAL_ESTATE OWN_RES
# 31 (ref) 0 0 0 0 0 0 0
# -------------------------------------------------------------------------------------
# 126 0 0 2 0 0 0 1
# 36 0 0 2 0 0 1 0
# 398 0 0 2 0 0 0 1
# 457 0 0 2 0 0 1 0
# 495 0 0 2 0 0 1 0
# 517 0 0 1 0 0 1 1
# 536 2 0 1 0 0 0 0
# 579 1 0 1 0 0 0 1
# 587 0 0 2 0 0 1 0
# 618 0 0 2 0 0 1 0
#
# LEARNING: Most of N have no money in CHK or SAV account, have no co-applicant or
# guarantor, own no real state or residence and have ben employed for les than 4 years
# LEARNING: The N method did allow for identification of high risk clients
#
crit_acc[ind_ii,c(4,1,3,14)] # Printing the table c(ID,HISTORY,AMOUNT,DURATION,RATE)
# The idea is to present this table to management for refinancing decisions as part of
# the special program to help them pay given their high risk profile.
#
# ID HISTORY AMOUNT DURATION RATE
# 126 4 2121 12 4
# 36 4 384 6 1
# 398 4 2348 36 3
# 457 4 3905 11 2
# 495 4 212 12 3
# 517 4 1361 6 2
# 536 4 2319 21 2
# 579 4 2820 36 4
# 587 4 279 9 2
# 618 4 3676 6 1
#
# --------------------------------------------------------------------------------------------------------------------------
################################# END OF CODE ################################
# --------------------------------------------------------------------------------------------------------------------------