Credit Score Modeling
Introduction to Credit Scoring
Credit scoring is best summed up by this quote from David Edelman who at the time was the Credit Director of the Royal Bank of Scotland:
Credit scoring is “one of the oldest applications of data mining, because it is one of the earliest uses of data to predict consumer behavior.”
Credit scoring is a statistical model that assigns a risk value to prospective or existing credit accounts. Typically, we think of credit scoring in the context of loans. We are trying to determine the likelihood of an individual to default on a loan.
Scorecards are a common way of displaying the patterns found in a binary response model. Typically, the models that underly scorecards are logistic regression models. The main benefits of scorecards are their clear and intuitive way of presenting the regression coefficients from a model. These scorecards are typically thought of in a credit modeling framework, but are not limited there as they are used in fraud detection, healthcare, and marketing fields as well.
Credit Scorecards
Credit scorecards, much like your FICO score, are a statistical risk model that was put into a special format designed for ease of interpretation. These are used to make strategic decisions such as accepting/rejecting applicants and deciding when to raise a credit line, as well as other decisions. The credit scorecard format is very popular and successful in the consumer credit world for three primary reasons:
- People at all levels within an organization generally find it easy to understand and use.
- Regulatory agencies are accustomed to credit risk models presented in this fashion.
- Credit scorecards are straightforward to implement and monitor over time.
Let’s examine a simple example of a scorecard to see these benefits. Below is a simple sccorecard built off of a three variable logistic regression model trying to predict deafult on a loan. The three variables are miss, which represents the number of months since the last missed payment for the applicant, home, which represents whether an applicant owns or rents their home, and income which is the income bracket for the applicant.
Imagine we had an applicant who had last had a missed payment 32 months ago, who owned their home, and had a salary of $30,000. They would have a score of 525 (120 + 225 + 180). Let’s assume our cut-off for giving a loan was a score of 500. If this was the case, the applicant would be given the loan. Now imagine we had another applicant who last missed a payment 22 months ago, owned their home, but only had an income of $8,000. They would have a score of 445 (100 + 225 + 120). They would not be given a loan.
This is extremely ease for anyone to use and implement in any computing system or database. This way the person making the loan decision has easy cut-offs and variable groupings to bucket an applicant in. They also have the ability to let an applicant know why they were rejected for a loan much easier. From our second individual, they had an income level that was in the lowest point bin for that variable. This is also true for their months since last last payment variable. These are the reasons they were rejected for a loan.
This ease of interpretation protects the consumer as it is their right to ask why they were rejected for a loan. This is why regulators appreciate the format of scorecards so much in the credit world.
Discrete vs. Continuous Time
Credit scoring tries to understand the probability of default on a customer (or business). However, default depends on time for its definition. When a customer or business will default is just as valuable as if they will. How we incorporate time into the evaluation of credit scoring is important for this reason.
Accounting for time is typically broken down into two approaches:
- Discrete time
- Continuous time
Discrete time evaluates binary decisions on predetermined intervals of time. For example, are you going to default in the next 30, 60, or 90 days. Each of these intervals have a separate binary credit scoring model. This approach is very popular in credit scoring consumers as people don’t actually care about the exact day of default as compared to the number of missed payments. Defaulting at 72 days isn’t needed as long as I know if a consumer defaulted between 60 and 90 days. These models used together can piece together windows of time where it is believed a consumer will default.
Continuous time evaluates the probability of default as it changes over continuous points in time. Instead of a series of binary classification models, survival analysis models are used for this approach as they can predict the exact day of default. This is more important in credit scoring businesses to determine the exact time a business may declare bankruptcy and default on a loan. This approach is starting to gain more popularity in consumer credit modeling to better help with the amount of capital to keep on hand with consumers defaulting at specific times as compared to windows of time.
Data Description
The first thing that we need to do is load up all of the needed libraries in R that we will be using in these course notes. This isn’t needed for the SAS sections of the code.
install.packages("gmodels")
install.packages("vcd")
install.packages("smbinning")
install.packages("dplyr")
install.packages("stringr")
install.packages("shades")
install.packages("latticeExtra")
install.packages("plotly")
library(gmodels)
library(vcd)
library(smbinning)
library(dplyr)
library(stringr)
library(shades)
library(latticeExtra)
library(plotly)
We will be using the auto loan data to build a credit scorecard for applicants for an auto loan. This credit scorecard predicts the likelihood of default for these applicants. There are actually two data sets we will use. The first is a data set on 5,837 people who were ultimately given auto loans. The variables in the accepts data set are the following:
Variable | Description |
---|---|
Age_oldest_tr | Age of oldest trade |
App_id | Application ID |
Bad | Godd/Bad loan |
Bankruptcy | Bankruptcy (1) or not (0) |
Bureau_score | Bureau score |
Down_pyt | Amount of down payment on vehicle |
Loan_amt | Amount of loan |
Loan_term | How many months vehicle was financed |
Ltv | Loan to value |
MSRP | Manufacturer suggested retail price |
Purch_price | Purchase price of vehicle |
Purpose | Lease or own |
Rev_util | Revolving utilization (balance/credit limit) |
Tot_derog | Total number of derogatory trades (go past due) |
Tot_income | Applicant’s income |
Tot_open_tr | Number of open trades |
Tot_rev_debt | Total revolving debt |
Tot_rev_line | Total revolving line |
Tot_rev_tr | Total revolving trades |
Tot_tr | Total number of trades |
Used_ind | Used car indicator |
Weight | Weight variable |
The accepts data set has actually been oversampled for us because the event of defaulting on the auto loans is only 5% in the population. Our sample has closer to a 20% default rate.
The second data set is 4,233 applicants who were ultimately rejected for a loan. We have the same information on these applicants except for the target variable of whether they defaulted since they were never given a loan. These individuals are still key to our analysis. Sampling bias occurs when we only use people who were given loans to make decisions on individuals who apply for loans. In order to correct for this bias, we perform reject inference on the individuals who applied, but were not given loans.
We will deal with this data set near the end of our credit scoring model process.
Data Collection and Cleaning
Defining the Target
When dealing with credit scoring data, the first major hurdle is to define the target variable of default. This might be harder than initially expected. When does someone actually default? Do you wait for the loan to be charged-off by the bank? There were probably plenty of signs before then that the customer would stop paying on their loan.
People use to always use 90 days past due (DPD) as the typical definition of default. If a customer goes 90 days past their due date for a payment on a loan, they would be considered a default. Now, default ranges between 90 and 180 days past due based on the types of loans, business sector, and country regulations. For example, in the United States, 180 days past due is the default standard on mortgage loans.
Predictor Variables
Selecting predictor variables in credit scoring models also takes care. Credit scoring models desire variables that have a high amount of predictability for default, but that isn’t the only criteria. Since regulators will be checking these models, they have to have vairables that are easily interpretable from a business stand-point. They also must be reliably and easily collected both now and in the future since you don’t want to make a mistake in a loan decision. These variables also must be though of ethically to ensure that the bank is being fair and equitable to all of their customers.
Feature engineering is an important part of the process for developing good predictor variables. In fact, good feature engineering can replace the need for more advanced modeling techniques that lack the interpretation needed for a good credit scorecard model. Features may be created based on business reasoning, such as the loan to value ratio, the expense to income ratio, or the credit line utilization across time. Variable clustering may also be used to omit variables that are highly dependent on each other.
Sampling
When it comes to sample size, there are no hard, fast rules on how much data is needed for building credit scoring models. The FDIC suggests that samples “normally include at least 1,000 good, 1,000 bad, and 750 rejected applicants.” However, sample size really depends on the overall dize of the portfolio, the number of predictor variables being planned for the model, and the number of defaults in the data.
Sampling must also be characteristic of the population to which the scorecard will be applied. For example, if the scorecard is to be applied in the subprime lending program, then we must use a sample that captures the characteristics of the subprime population targeted. Here are the two main steps for sampling for credit scoring models:
- Gather data for accounts opened during a specific time frame.
- Monitor the performance of these accounts for another specific length of time to determine if they were good or bad.
This approach raises natural concerns. Accounts that are opened more recently are more similar to account that will be opened in the near future so we don’t want to go too far back in time to sample. However, we want to minimize the chances of misclassifying the performance of the account so we need to monitor the accounts long enough to let them fail. Banks develop cohort graphs to help them determine how long a typical customer takes to default on a loan. Essentially, watch customer accounts and their default rates. When these default rates tend to level off, then a majority of the customers who will have defaulted will be done. This relies on the empirical data that shows that customers that default typically do so early in the life of a loan. From these cohort charts comes the concepts of sample and performance windows.
For example, let’s imagine the typicaly amount of time a customer takes to default on a loan is 14 months. If our analysis is to be performed on March of this year, we will select our sample from 12-16 months back. This will give us an average of 14 months for our performance window. An example of this is shown below.
Now that our data is se, we can move into truly preparing our variables for modeling.
Variable Grouping and Selection
Feature creation and selection is one of the most important pieces to any modeling process. It is no different for credit score modeling. Before selecting the variables, we need to transform them. Specifically in credit score modeling, we need to take our continuous variables and bin them into categorical versions.
Variable Grouping
Scorecards end up with only just bins within a variable. There are two primary objectives when deciding on how to bin the variables:
- Eliminate weak variables or those that do not conform to good business logic.
- Group the strongest variables’ attribute levels (values) in order to produce a model in the scorecard format.
Binning continuous variables help simplify analysis. We no longer need to explain coefficients that imply some notion of constant effect or linearity, but instead are just comparisons of categories. This process of binning also models non-linearities in an easily interpretable way. We are not restricted to linearity of the continuous variables as some models assume. Outliers are also easily accounted for as they are typically contained within the smallest or largest bin. Lastly, missing values are no longer a problem and do not need imputation. Missing values can get their own bin making all observations available to be modeled.
There are a variety of different approaches to statistically bin variables. We will focus on the two most popular ones here:
- Prebinning and Grouping of Bins
- Conditional Inference Trees
The first is by prebinning the variables followed by the grouping of these bins. Imagine you had a variable whose range is from 3 to 63. This approach would first break this variable into quantile bins. Softwares typically use anywhere from 20 to 100 equally sized quantiles for this initial step. From there, we use chi-square tests to compare each adjacent pair of bins. If the bins are statistically the same with respect to the target variable using two by two contingency table Chi-square tests (Mantel-Haenzel for example), then we combine the bins. We repeat this process until no more adjacent pairs of bins can be statistically combined. Below is a visual of this process.
The second common approach is through conditional inference trees. These are a variation on the common CART decision tree. CART methods for decision trees have inherent bias - variables with more levels are more likely to be split on if split using the Gini and entropy criterion. Conditional inference trees on the other hand add an extra step to this process. Conditional inference trees evaluate which variable is most significant first, then evaluate what is the best split of a continuous variable through the Chi-square decision tree approach on that specific variable only, not all variables. They repeat this process until no more significant variables are left to be split. How does this apply to binning though? When binning a continuous variable, we are predicting the target variable using only our one continuous variable in the conditional inference tree. It evaluates if the variable is significant at predicting the target variable. If so, it finds the most significant statistical split using Chi-square tests in between each value of the continuous variable and then comparing the two groups formed by this split. After finding the most significant split you have two continuous variables - one below the split and one above. The process repeats itself until the algorithm can no longer find significant splits leading to the definition of your bins. Below is a visual of this process.
Cut-offs (or cut points) from the decision tree algorithms might be rather rough. Sometimes we override the automatically generated cut points to more closely conform to business rules. These overrides might make the bins suboptimal, but hopefully not too much to impact the analysis.
Imagine a similar scenario for linear regression. Suppose you had two models with the first model having a \(R^2_A = 0.8\) and the second model having an \(R^2_A = 0.78\). However, the second model made more intuitive business sense than the first. You would probably choose the second model willing to sacrifice a small amount of predictive power for a model that made more intuitive sense. The same can be thought of when slightly altering the bins from these two approaches described above.
Let’s see how each of our softwares approaches binning continuous variables!
R
The R package that you choose will determine the technique that is used for the binning of the continuous variables. The scorecard
package more closely aligns with the SAS approach of prebinning the variable and combining the bins. The smbinning
package as shown below uses the conditional inference tree approach.
The smbinning
function inside the smbinning
package is the primary function to bin continuous variables. Our data set has a variable bad that flags when an observation has a default. However, the smbinning
function needs a variable that defines the people in our data set that did not have the event - those who did not default. Below we create this new good variable in our training data set.
##
## 0 1
## 1196 4641
We also need to make the categorical variables in our data set into factor variables in R so the function will not automatically assume they are numeric just because they have numerical values. We can do this with the as.factor
function.
accepts$bankruptcy <- as.factor(accepts$bankruptcy)
accepts$used_ind <- as.factor(accepts$used_ind)
accepts$purpose <- as.factor(accepts$purpose)
Before any binning is done, we need to split our data into training and testing because the binning evaluates relationships between the target variable and the predictor variables. This is easily done in R using the sample
function to sample row numbers. The size =
option identifies the number of observations to be sampled. It was set as 75% of the number of rows in the dataset.
From there we can just identify the sampled rows as the training set and the remaining rows as testing set. The set.seed
function is used to replicate the results.
set.seed(12345)
train_id <- sample(seq_len(nrow(accepts)), size = floor(0.75*nrow(accepts)))
train <- accepts[train_id, ]
test <- accepts[-train_id, ]
Now we are ready to bin our variables. Let’s go through an example of binning the bureau_score variable using the smbinning
function. The three main inputs to the smbinning
function are the df =
option which defines the data frame for your data, the y =
option that defines the target variable by name, and the ``x = ``` option that defines the predictor variable to be binned by name.
## Cutpoint CntRec CntGood CntBad CntCumRec CntCumGood CntCumBad PctRec GoodRate
## 1 <= 603 223 112 111 223 112 111 0.0509 0.5022
## 2 <= 662 1056 678 378 1279 790 489 0.2413 0.6420
## 3 <= 699 939 754 185 2218 1544 674 0.2145 0.8030
## 4 <= 717 514 440 74 2732 1984 748 0.1174 0.8560
## 5 <= 765 899 824 75 3631 2808 823 0.2054 0.9166
## 6 > 765 513 498 15 4144 3306 838 0.1172 0.9708
## 7 Missing 233 153 80 4377 3459 918 0.0532 0.6567
## 8 Total 4377 3459 918 NA NA NA 1.0000 0.7903
## BadRate Odds LnOdds WoE IV
## 1 0.4978 1.0090 0.0090 -1.3176 0.1167
## 2 0.3580 1.7937 0.5843 -0.7423 0.1602
## 3 0.1970 4.0757 1.4050 0.0785 0.0013
## 4 0.1440 5.9459 1.7827 0.4562 0.0213
## 5 0.0834 10.9867 2.3967 1.0701 0.1675
## 6 0.0292 33.2000 3.5025 2.1760 0.2777
## 7 0.3433 1.9125 0.6484 -0.6781 0.0291
## 8 0.2097 3.7680 1.3265 0.0000 0.7738
## [1] 603 662 699 717 765
The ivtable
element contains a summary of the splits as well as some information regarding each split. Working from left to right, the columns represent the number of observations in each bin, the number of goods (non-defaulters) and bads (defaulters) in each bin, as well as the cumulative versions of all of the above. Next comes the percentage of observations that are in the bin as well as percentage of observations in the bin that are both good and bad. Finally, the tbale lists the odds, natural log of the odds, weight of evidence (WoE), and information value component. These last few are explained in the next section below.
The cut
element in the smbinning
object contains a vector of the actual split points for the bins.
The smbinning.plot
function will make barplots of some of the above metrics for each bin. Specifically, we are plotting the percentage of observations that are in the bin as well as percentage of observations in the bin that are both good and bad using the option = "dist"
, option = "goodrate"
, and option = "badrate"
respectively. The sub =
option makes a subtitle for each plot.
SAS
SAS takes the approach of prebinning and then combining the bins statistically to bin continuous variables. However, before any binning is done, we need to split our data into training and testing because the binning evaluates relationships between the target variable and the predictor variables. This is easily done in SAS using the SURVEYSELECT
procedure to sample row numbers. The data =
option identifies the data we are interested in splitting. The method = srs
option specifies that we want simple random sampling to split our data into training and testing. The out =
option names the dataset that has the flagged observations. The samprate = 0.75
specifies that number of observations to be sampled. It was set as 75%. The outall
option is key to keep both the training and testing observations in the final data set.
From there we can just split the data using a DATA
step. We create both the train and valid datasets using the if
and else
statements to put the selected observations (from PROC SURVEYSELECT
) into the training set and the others in the testing set.
proc surveyselect data = public.accepts method = srs noprint
out=accepts_split seed=12345 samprate=0.75 outall;
run;
data train valid;
set accepts_split;
if Selected = 1 then output train;
else output valid;
run;
Now we are ready to bin our variables. Let’s go through an example of binning the bureau_score variable using the BINNING
procedure The main options to PROC BINNING
statement are the DATA =
option which defines the data frame for your data and the METHOD = TREE
option that defines tree based approach to binning. The initbin = 100
option specifies how many initial bins to split the variable into. The maxnbins = 100
option specifies that SAS can split the variable into at most 100 levels. The TARGET
statement defines the target variable. The INPUT
statement defines the variable we are binning, bureau_score. The ODS
statement is used for calculation of weight of evidence (WoE) that we will discuss in the next section.
proc binning data = train method = tree(initbin = 100 maxnbins = 10);
target bad / level = int;
input bureau_score / level = int;
ods output BinDetails = bincuts VarTransInfo = bincount;
run;
Working from left to right, the columns represent the bin number, the lower and upper bounds for the bin, the width of the bin, the number of observations in each bin, and some summary statistics for each bin (mean, standard deviation, minimum, and maximum).
Weight of Evidence (WOE)
Weight of evidence (WOE) measures the strength of the attributes (bins) of a variable in separating events and non-events in a binary target variable. In credit scoring, that implies separating bad and good accounts respectively.
Weight of evidence is based on comparing the proportion of goods to bads at each bin level and is calculated as follows for each bin within a variable:
\[ WOE_i = \log(\frac{Dist. Good_i}{Dist.Bad_i}) \] The distribution of goods for each bin is the number of goods in that bin divided by the total number of goods across all bins. The distribution of bads for each bin is the number of bads in that bin divided by the total number of bads across all bins. An example is shown below:
WOE summarizes the separation between events and non-events (bads and goods) as shown in the following table:
For WOE we are looking for big differences in WOE between bins. Ideally, we would like to see monotonic increases for variables that have ordered bins. This isn’t always required as long as the WOE pattern in the bins makes business sense. However, if a variable’s bins go back and forth between positive and negative WOE values across bins, then the variable typically has trouble separating goods and bads. Graphically, the WOE values for all the bins in the bureau_score variable look as follows:
WOE approximately zero implies percentages of non-events (goods) are approximately equal to percentages of events (bads) so that bin doesn’t do a good job of separating these events and non-events. WOE of positives values implies the bin identifies observations that are non-events (goods), while WOE of negative values implies bin identifies observations that are events (bads).
One quick side note. WOE values can take a value of infinity or negative infinity when quasi-complete separation exists in a category (zero events or non-events). Some people adjust the WOE calculation to include a small smoothing parameter to make the numerator or denominator of the WOE calculation not equal to zero.
Let’s see how to get the weight of evidence values in each of our softwares!
R
The smbinning
function inside the smbinning
package is the primary function to bin continuous variables. Let’s go through an example of binning the bureau_score variable using the smbinning
function. The three main inputs to the smbinning
function are the df =
option which defines the data frame for your data, the y =
option that defines the target variable by name, and the ``x = ``` option that defines the predictor variable to be binned by name.
As we previously saw, the ivtable
element contains a summary of the splits as well as some information regarding each split including weight of evidence.
Cutpoint CntRec CntGood CntBad CntCumRec CntCumGood CntCumBad PctRec GoodRate
1 <= 603 223 112 111 223 112 111 0.0509 0.5022
2 <= 662 1056 678 378 1279 790 489 0.2413 0.6420
3 <= 699 939 754 185 2218 1544 674 0.2145 0.8030
4 <= 717 514 440 74 2732 1984 748 0.1174 0.8560
5 <= 765 899 824 75 3631 2808 823 0.2054 0.9166
6 > 765 513 498 15 4144 3306 838 0.1172 0.9708
7 Missing 233 153 80 4377 3459 918 0.0532 0.6567
8 Total 4377 3459 918 NA NA NA 1.0000 0.7903
BadRate Odds LnOdds WoE IV
1 0.4978 1.0090 0.0090 -1.3176 0.1167
2 0.3580 1.7937 0.5843 -0.7423 0.1602
3 0.1970 4.0757 1.4050 0.0785 0.0013
4 0.1440 5.9459 1.7827 0.4562 0.0213
5 0.0834 10.9867 2.3967 1.0701 0.1675
6 0.0292 33.2000 3.5025 2.1760 0.2777
7 0.3433 1.9125 0.6484 -0.6781 0.0291
8 0.2097 3.7680 1.3265 0.0000 0.7738
The weight of evidence values are listed in the WoE column and the same values as shown above. We can easily get the plot of the WOE values using the smbinning.plot
function with the option = "WoE"
option. The resulting plot is the same as the WOE plot above.
You can get weight of evidence values for a factor variable as well without needing to rebin the values. This is done using the smbinning.factor
function on the purpose variable as shown below:
Cutpoint CntRec CntGood CntBad CntCumRec CntCumGood CntCumBad PctRec
1 = 'LEASE' 1466 1149 317 1466 1149 317 0.3349
2 = 'LOAN' 2911 2310 601 4377 3459 918 0.6651
3 Missing 0 0 0 4377 3459 918 0.0000
4 Total 4377 3459 918 NA NA NA 1.0000
GoodRate BadRate Odds LnOdds WoE IV
1 0.7838 0.2162 3.6246 1.2877 -0.0388 0.0005
2 0.7935 0.2065 3.8436 1.3464 0.0199 0.0003
3 NaN NaN NaN NaN NaN NaN
4 0.7903 0.2097 3.7680 1.3265 0.0000 0.0008
SAS
SAS takes the approach of prebinning and then combining the bins statistically to bin continuous variables. Let’s go through an example of binning the bureau_score variable using the BINNING
procedure The main options to PROC BINNING
statement are the DATA =
option which defines the data frame for your data and the METHOD = TREE
option that defines tree based approach to binning. The initbin = 100
option specifies how many initial bins to split the variable into. The maxnbins = 100
option specifies that SAS can split the variable into at most 100 levels. The TARGET
statement defines the target variable. The INPUT
statement defines the variable we are binning, bureau_score. The ODS
statement is used for calculation of weight of evidence (WoE) as we need to output the specific points of the splits for the bins using the BinDetails =
option and the number of bins using the VarTransInfo =
option.
From there we use the DATA
step to create a MACRO variable called numbin in SAS to define the number of bins using the CALL SYMPUT
functionality. To get a dataset that contains the values where the bins separate we need to use PROC SQL
where we select the variable Max from the bincuts dataset created from PROC BINNING
. We place the values of the Max variable into a MACRO variable called cuts.
Lastly, we reuse PROC BINNING
. We can calculate the WOE values using the woe
option. However, the woe
option can only be used when the bins are defined by the user, which is why we needed the optimal bins calculated first before getting the WOE values. The only difference for the second instance of PROC BINNING
is defining the value of the target variable that is an event in the TARGET
statement using the event =
option.
proc binning data = train method = tree(initbin = 100 maxnbins = 10);
target bad / level = int;
input bureau_score / level = int;
ods output BinDetails = bincuts VarTransInfo = bincount;
run;
data _null_;
set bincount;
call symput('numbin', Nbins - 1);
run;
proc sql;
select Max
into :cuts separated by ' '
from bincuts(firstobs = 2 obs = &numbin);
quit;
proc binning data = train numbin = &numbin method=cutpts(&cuts) woe;
target bad / event = '1';
input bureau_score / level = int;
run;
You can get weight of evidence values for a factor variable as well without needing to rebin the values. This is done by us calculating the WOE values ourselves using the PROC TABULATE
and PROC TRANSPOSE
procedures on the purpose variable as shown below:
proc tabulate data=public.train out=facwoe;
class bad purpose;
table purpose, bad*colpctn / rts=10;
run;
proc transpose data = facwoe out = facwoe2(rename=(col1 = bad0 col2 = bad1));
var PctN_10;
by purpose;
run;
data facwoe2;
set facwoe2;
WOE = log(bad1/bad0);
run;
proc print data = facwoe2;
run;
Information Value
Weight of evidence summarizes the individual categories or bins of a variable. However, we need a measure of how well all the categories in a variable do at separating the events from non-events. That is what information value (IV) is for. IV uses the WOE from each category as a piece of its calculation:
\[ IV = \sum_{i=1}^L (Dist.Good_i - Dist.Bad_i)\times \log(\frac{Dist.Good}{Dist.Bad}) \] In credit modeling, IV is used in some instances to actually select which variables belong in the model. Here are some typical IV ranges for determining the strength of a predictor variable at predicting the target variable:
- \(0 \le IV < 0.02\) - Not predictor
- \(0.02 \le IV < 0.1\) - Weak predictor
- \(0.1 \le IV < 0.25\) - Moderate (medium) predictor
- \(0.25 \le IV\) - Strong predictor
Variables with information values greater than 0.1 are typically used in credit modeling.
Some resources will say that IV values greater than 0.5 might signal over-predicting of the target. In other words, maybe the variable is too strong of a predictor because of how that variable was originally constructed. For example, if all previous loan decisions have been made only on bureau score, then of course that variable would be highly predictive and possibly the only significant variable. In these situations, good practice is to build one model with only bureau score and another model without bureau score but with other important factors. We then ensemble these models together.
Let’s see how to get the information values for our variables in each of our softwares!
R
The smbinning
function inside the smbinning
package is the primary function to bin continuous variables. Let’s go through an example of binning the bureau_score variable using the smbinning
function. The three main inputs to the smbinning
function are the df =
option which defines the data frame for your data, the y =
option that defines the target variable by name, and the ``x = ``` option that defines the predictor variable to be binned by name.
As we previously saw, the ivtable
element contains a summary of the splits as well as some information regarding each split including information value.
Cutpoint CntRec CntGood CntBad CntCumRec CntCumGood CntCumBad PctRec GoodRate
1 <= 603 223 112 111 223 112 111 0.0509 0.5022
2 <= 662 1056 678 378 1279 790 489 0.2413 0.6420
3 <= 699 939 754 185 2218 1544 674 0.2145 0.8030
4 <= 717 514 440 74 2732 1984 748 0.1174 0.8560
5 <= 765 899 824 75 3631 2808 823 0.2054 0.9166
6 > 765 513 498 15 4144 3306 838 0.1172 0.9708
7 Missing 233 153 80 4377 3459 918 0.0532 0.6567
8 Total 4377 3459 918 NA NA NA 1.0000 0.7903
BadRate Odds LnOdds WoE IV
1 0.4978 1.0090 0.0090 -1.3176 0.1167
2 0.3580 1.7937 0.5843 -0.7423 0.1602
3 0.1970 4.0757 1.4050 0.0785 0.0013
4 0.1440 5.9459 1.7827 0.4562 0.0213
5 0.0834 10.9867 2.3967 1.0701 0.1675
6 0.0292 33.2000 3.5025 2.1760 0.2777
7 0.3433 1.9125 0.6484 -0.6781 0.0291
8 0.2097 3.7680 1.3265 0.0000 0.7738
The information value is listed in the IV column and the last row. The IV numbers in each of the rows for the bins is the component of the IV from each of the categories. The final row is the sum of the previous rows which is the overall variable IV.
Another way to view the information value for every variable in the dataset is to use the smbinning.sumiv
function. The only two inputs to this function are the data =
option where you define the dataset and the y =
option to define the target variable. The function then calculates the IV for each variable in the dataset with the target variable.
To view these information values for each variable we can just print out a table of the results by calling the object by name. We can also use the smbinning.sumiv.plot
function on the object to view them in a plot:
Char IV Process
11 bureau_score 0.7738 Numeric binning OK
9 tot_rev_line 0.3987 Numeric binning OK
10 rev_util 0.3007 Numeric binning OK
5 age_oldest_tr 0.2512 Numeric binning OK
3 tot_derog 0.2443 Numeric binning OK
18 ltv 0.1456 Numeric binning OK
4 tot_tr 0.1304 Numeric binning OK
14 down_pyt 0.0848 Numeric binning OK
8 tot_rev_debt 0.0782 Numeric binning OK
19 tot_income 0.0512 Numeric binning OK
16 loan_term 0.0496 Numeric binning OK
13 msrp 0.0360 Numeric binning OK
12 purch_price 0.0204 Numeric binning OK
20 used_ind 0.0183 Factor binning OK
1 bankruptcy 0.0009 Factor binning OK
15 purpose 0.0008 Factor binning OK
2 app_id NA No significant splits
6 tot_open_tr NA No significant splits
7 tot_rev_tr NA No significant splits
17 loan_amt NA No significant splits
21 bad NA Uniques values < 5
22 weight NA Uniques values < 5
As we can see from the output above, the strong predictors of default are bureau_score, tot_rev_line, and rev_util. The moderate or medium predictors are age_oldest_tr, tot_derog, ltv, and tot_tr. These would be the variables typically used in credit modeling due to having IV scores above 0.1.
SAS
SAS takes the approach of prebinning and then combining the bins statistically to bin continuous variables. Let’s go through an example of binning the bureau_score variable using the BINNING
procedure The main options to PROC BINNING
statement are the DATA =
option which defines the data frame for your data and the METHOD = TREE
option that defines tree based approach to binning. The initbin = 100
option specifies how many initial bins to split the variable into. The maxnbins = 100
option specifies that SAS can split the variable into at most 100 levels. The TARGET
statement defines the target variable. The INPUT
statement defines the variable we are binning, bureau_score. The ODS
statement is used for calculation of weight of evidence (WoE) as we need to output the specific points of the splits for the bins using the BinDetails =
option and the number of bins using the VarTransInfo =
option.
From there we use the DATA
step to create a MACRO variable called numbin in SAS to define the number of bins using the CALL SYMPUT
functionality. To get a dataset that contains the values where the bins separate we need to use PROC SQL
where we select the variable Max from the bincuts dataset created from PROC BINNING
. We place the values of the Max variable into a MACRO variable called cuts.
Lastly, we reuse PROC BINNING
. We can calculate the WOE values and information value using the woe
option. However, the woe
option can only be used when the bins are defined by the user, which is why we needed the optimal bins calculated first before getting the WOE values. The only difference for the second instance of PROC BINNING
is defining the value of the target variable that is an event in the TARGET
statement using the event =
option.
proc binning data = train method = tree(initbin = 100 maxnbins = 10);
target bad / level = int;
input bureau_score / level = int;
ods output BinDetails = bincuts VarTransInfo = bincount;
run;
data _null_;
set bincount;
call symput('numbin', Nbins - 1);
run;
proc sql;
select Max
into :cuts separated by ' '
from bincuts(firstobs = 2 obs = &numbin);
quit;
proc binning data = train numbin = &numbin method=cutpts(&cuts) woe;
target bad / event = '1';
input bureau_score / level = int;
run;
Gini Statistic
The Gini statistic is an optional technique that tries to answer the same question as information value - which variables are strong enough to enter the scorecard model. Since information value is more in line with weight of evidence calculations it is used much more often in practice.
The Gini statistic ranges between 0 and 100 where bigger values are better. A majority of the time the Gini statistic and IV will agree on variable importance, but might differ on borderline cases. The more complicated technique is calculated as follows:
\[ Gini = (1 - \frac{(2\sum_{i=2}^L(n_{i,event} \times \sum_{i=1}^{i-1}n_{i,non-event})+\sum_{i=1}^L(n_{i,event} \times n_{i,non-event}))}{N_{event}\times N_{non-event}}) \times 100 \]
Scorecard Creation
Now that we have transformed our variables for modeling, we can start with the process of building our model. In building credit models, we first build an initial credit scoring model. From there we will incorporate our rejected customers through reject inference to build our full, final credit score model.
Initial Scorecard Creation
In each of the models that we build we must take the following three steps:
- Build the model
- Evaluate the model
- Convert the model to a scorecard
Building the Model
The scorecard is typically based on a logistic regression model:
\[ logit(p) = \log(\frac{p}{1-p}) = \beta_0 + \beta_1 x_1 + \cdots + \beta_k x_k \] In the above equation, \(p\) is the probability of default given the inputs in the model. However, instead of using hte original variables for the model, credit scoring models and their complimenting scorecards are based on binned variables as their foundation. Instead of treating the binned variables are categorical, the values of the bins are replaced by the WOE values for the categories:In other words, if a person has a bureau score of 705, then their observation takes the value of 0.46 as seen in the table above. These inputs are still treated as continuous even though they only take a limited number of values. Additionally, the variables are all on the same scale. You can think of the transformation we performed as scaling all the variables based on their predictive ability for the target variable.
Since all the variables are on the same scale, the \(\beta\) coefficients from the logistic regression model now serve as variable importance measures. These coefficients are actually the only thing we desire to gain from the logistic regression as they help define the point schemes for the scorecard.
Let’s see how to build our model in our software!
R
Since most credit models are built off variables with information values of at least 0.1, the following R code takes all the continuous variables in your dataset, uses the smbinning
function to bin the variables, and stores all the results in a list called result_all_sig only for variables with \(IV \ge 0.1\).
The smbinning.gen
function will create binned, factor variables in R based on the results from the smbinning
function. The df =
option defines the dataset. The ivout =
option defines the specific results for the variable of interest as you can only do this function one variable at a time. The chrname =
function defines the names of the new binned variable in your dataset. The following example uses the bureau_score variable.
To create a variable that has WOE values instead of the binned values from the smbinning
function you need to use a loop. The following R code is an example for bureau_score that takes the newly created binned variable from smbinning.gen
and creates a new variable where the WOE values are used for each observation instead of the binned values.
for (i in 1:nrow(train)) {
bin_name <- "bureau_score_bin"
bin <- substr(train[[bin_name]][i], 2, 2)
woe_name <- "bureau_score_WOE"
if(bin == 0) {
bin <- dim(result_all_sig$bureau_score$ivtable)[1] – 1
train[[woe_name]][i] <- result_all_sig$bureau_score$ivtable[bin, "WoE"]
} else {
train[[woe_name]][i] <- result_all_sig$bureau_score$ivtable[bin, "WoE"]
}
}
We would then repeat the process for all of our variables in the dataset. The following R code loops through all the variables in the result_all_sig object and does just that.
Now that we have created our variables that are replaced with their WOE values, we can build our logistic regression model. The glm
function in R will provide us the ability to model binary logistic regressions. Similar to most modeling functions in R, you can specify a model formula. The family = binomial(link = "logit")
option is there to specify that we are building a logisitc model. Generalized Linear models (GLM) are a general class of models where logistic regression is a special case where the link function is the logit function. Use the summary
function to look at the necessary output.
initial_score <- glm(data = train, bad ~ tot_derog_WOE +
tot_tr_WOE +
age_oldest_tr_WOE +
tot_rev_line_WOE +
rev_util_WOE +
bureau_score_WOE +
ltv_WOE,
weights = train$weight, family = "binomial")
summary(initial_score)
Call:
glm(formula = bad ~ tot_derog_WOE + tot_tr_WOE + age_oldest_tr_WOE +
tot_rev_line_WOE + rev_util_WOE + bureau_score_WOE + ltv_WOE,
family = "binomial", data = train, weights = train$weight)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.6982 -0.7675 -0.4499 -0.1756 3.3564
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.90953 0.03942 -73.813 < 0.0000000000000002 ***
tot_derog_WOE -0.10407 0.08130 -1.280 0.2005
tot_tr_WOE -0.03890 0.13225 -0.294 0.7687
age_oldest_tr_WOE -0.38879 0.09640 -4.033 0.0000551 ***
tot_rev_line_WOE -0.33535 0.08381 -4.001 0.0000631 ***
rev_util_WOE -0.18548 0.08020 -2.313 0.0207 *
bureau_score_WOE -0.81875 0.05659 -14.468 < 0.0000000000000002 ***
ltv_WOE -0.93623 0.10052 -9.314 < 0.0000000000000002 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 7182.6 on 4376 degrees of freedom
Residual deviance: 6363.4 on 4369 degrees of freedom
AIC: 6470.9
Number of Fisher Scoring iterations: 6
Let’s examine the output above. Scanning down the output, you can see the actual logistic regression equation itself for each of the variables. Again, credit models are typically built with all variables that have information values of at least 0.1 regardless of their significance in the model. However, at a reasonable significance level (we used an 0.005 significance level for this analysis based on the sample size) it appears that the variables tot_derog_WOE, tot_tr_WOE, and rev_util_WOE are not significant. You can easily perform variable selection based on other metrics like BIC, significance level, etc.
Model Evaluation
Credit models are evaluated as most classification models. Overall model performance is typically evaluated on area on the ROC curve as well as the K-S statistic.
Let’s see how to perform this in our software!
R
Luckily, the smbinning
package has great functionality for evaluating model performance. The smbinning.metrics
function provides many summary statistics and plots to evaluate our models. First, we must get the predictions from our model by creating a new variable pred in our dataset from the fitted.values
element of our glm
model object. This new pred variable is one of the inputs of the smbinning.metrics
function. The dataset =
option defines our dataset. The prediction =
option is where we define the variable in the dataset with the predictions from our model. The actualclass =
option defines the target variable from our dataset. The report = 1
option prints out a report with a variety of summary statistics as shown below:
train$pred <- initial_score$fitted.values
smbinning.metrics(dataset = train, prediction = "pred", actualclass = "bad", report = 1)
Overall Performance Metrics
--------------------------------------------------
KS : 0.4063 (Good)
AUC : 0.7638 (Fair)
Classification Matrix
--------------------------------------------------
Cutoff (>=) : 0.0617 (Optimal)
True Positives (TP) : 653
False Positives (FP) : 1055
False Negatives (FN) : 265
True Negatives (TN) : 2404
Total Positives (P) : 918
Total Negatives (N) : 3459
Business/Performance Metrics
--------------------------------------------------
%Records>=Cutoff : 0.3902
Good Rate : 0.3823 (Vs 0.2097 Overall)
Bad Rate : 0.6177 (Vs 0.7903 Overall)
Accuracy (ACC) : 0.6984
Sensitivity (TPR) : 0.7113
False Neg. Rate (FNR) : 0.2887
False Pos. Rate (FPR) : 0.3050
Specificity (TNR) : 0.6950
Precision (PPV) : 0.3823
False Discovery Rate : 0.6177
False Omision Rate : 0.0993
Inv. Precision (NPV) : 0.9007
Note: 0 rows deleted due to missing data.
The report provides multiple pieces of model evaluation. At the top it provides the KS and AUC metrics for the model. Next, the report summarizes metrics from the classification matrix. At the top of this section it provides the optimal cut-off level based on the Youden J statistic. At this cut-off it provides the number of true positives, false positives, true negatives, false negatives, total positives, and total negatives. The last section of the report provides many business performance metrics such as sensitivity, specificity, precision, and many more as seen above.
By using the plot =
option in the smbinning.metrics
function you can plot either the KS plot or ROC curve.
We can perform the same evaluation of our initial model on the testing dataset as well. We need to create our WOE variables in our testing dataset which is easy to do with the smbinning.gen
function on the test dataset. Remember, we are just scoring the test dataset so we do not want to build new bins, just create the same ones from our training in the test set. By using the same looping process as above we can create our variables. We then use the predict
function on the test dataset to get the predictions. The same smbinning.metrics
function is used to graph and report metrics for the testing set predictions.
for(i in 1:length(result_all_sig)) {
test <- smbinning.gen(df = test, ivout = result_all_sig[[i]], chrname = paste(result_all_sig[[i]]$x, "_bin", sep = ""))
}
for (j in 1:length(result_all_sig)) {
for (i in 1:nrow(test)) {
bin_name <- paste(result_all_sig[[j]]$x, "_bin", sep = "")
bin <- substr(test[[bin_name]][i], 2, 2)
woe_name <- paste(result_all_sig[[j]]$x, "_WOE", sep = "")
if(bin == 0) {
bin <- dim(result_all_sig[[j]]$ivtable)[1] - 1
test[[woe_name]][i] <- result_all_sig[[j]]$ivtable[bin, "WoE"]
} else {
test[[woe_name]][i] <- result_all_sig[[j]]$ivtable[bin, "WoE"]
}
}
}
test$pred <- predict(initial_score, newdata=test, type='response')
smbinning.metrics(dataset = test, prediction = "pred", actualclass = "bad", report = 1)
Overall Performance Metrics
--------------------------------------------------
KS : 0.4589 (Good)
AUC : 0.7798 (Fair)
Classification Matrix
--------------------------------------------------
Cutoff (>=) : 0.0577 (Optimal)
True Positives (TP) : 216
False Positives (FP) : 376
False Negatives (FN) : 62
True Negatives (TN) : 806
Total Positives (P) : 278
Total Negatives (N) : 1182
Business/Performance Metrics
--------------------------------------------------
%Records>=Cutoff : 0.4055
Good Rate : 0.3649 (Vs 0.1904 Overall)
Bad Rate : 0.6351 (Vs 0.8096 Overall)
Accuracy (ACC) : 0.7000
Sensitivity (TPR) : 0.7770
False Neg. Rate (FNR) : 0.2230
False Pos. Rate (FPR) : 0.3181
Specificity (TNR) : 0.6819
Precision (PPV) : 0.3649
False Discovery Rate : 0.6351
False Omision Rate : 0.0714
Inv. Precision (NPV) : 0.9286
Note: 0 rows deleted due to missing data.
Scaling the Scorecard
The last step of the credit modeling process is building the scorecard itself. To create the scorecard we need to relate the predicted odds from our logistic regression model to the scorecard. The relationship between the odds and scores is represented by a linear function:
\[ Score = Offset + Factor \times \log(odds) \]
All that we need to define is the amount of points to double the odds (called PDO) and the corresponding scorecard points. From there we have the following extra equation:
\[ Score + PDO = Offset + Factor \times \log(2 \times odds) \]
Through some basic algebra, the solution to the \(Factor\) and \(Offset\) is shown to be:
\[ Factor = \frac{PDO}{\log(2)} \]
\[ Offset = Score - Factor \times \log(odds) \]
For example, if a scorecard were scaled where the developer wanted odds of 50:1 at 600 points and wanted the PDO = 20. Through the above equations we calculate \(Factor = 28.85\) and \(Offset = 487.12\). Therefore, the corresponding score for each predicted odds from the logistic regression model is calculated as:
\[ Score = 487.12 + 28.85\times \log(odds) \]
For this example, we would then calculate the score for each individual in our dataset. Notice how the above equation has the \(\log(odds)\) which is the prediction from a logistic regression model \(\log(odds) = \hat{\beta}_0 + \hat{\beta}_1 x_1 \cdots\). This is one of the reasons logistic regression is still very popular in the credit modeling world.
The next step in the scorecard is to allocate the scorecard points to each of the categories (bins) in each of the variables. The points that are allocated to the \(i^{th}\) bin of variable \(j\) are computed as follows:
\[ Points_{i,j} = -(WOE_{i,j} \times \hat{\beta}_j + \frac{\hat{\beta}_0}{L}) \times Factor + \frac{Offset}{L} \] The \(WOE_{i,j}\) is the weight of evidence of the \(i^{th}\) bin of variable \(j\). The coefficient of the variable \(j\), \(\hat{\beta}_j\), as well as the intercept \(\hat{\beta}_0\), come from the logistic regression model. \(L\) is the number of variables in the model. With the \(Factor\) and \(Offset\) defined above as well as the bureau_score coefficient of \(-0.81875\) and intercept of \(-2.90953\) we calculate the points for each category of bureau_score as follows:
The other variables in dataset would go through a similar process to build out the full scorecard.
Let’s see how to do this in our software!
R
Since we have variables with WOE values for each variable in the dataset, allocating the points to each category of the variable is easy to do. We just use a for
loop to perform the above calculations for each variable:
The above code also calculates a Score variable for our training data set that sums up all the points for each observation. We can do this same process for the testing dataset as well.
Lastly, we can view the distribution of all of the scorecard values for each observation in our dataset across both the training and testing datasets. This gives us a visual of the range of values to expect from our scorecard model.
Reject Inference
The previous scorecard that we have built is a behavioral scorecard because it models the behavior of current customers. However, it doesn’t fully capture the effects of applicants because it is only based on current customers who had approved applications. We still have customers who were rejected for loans. Reject inference is the process of inferring the status of the rejected applicants based on the accepted applicants model (behavioral model) in an attempt to use their information to build a scorecard that is representative of the entire applicant population. Reject inference is about solving sample bias so that the development sample is similar to the population to which the scorecard will be applied. Scorecards using reject inference are referred to as application scorecards since they more mimic the “through-the-foor” population. Reject inference also helps comply with regulatory requirements like ones provided by the FDIC and Basel Accords.
There are three common techniques for reject inference:
- Hard Cut-off Augmentation
- Parceling Augmentation
- Fuzzy Augmentation
Hard Cut-off Augmentation
The hard cut-ff augmentation essentially scores all the rejected individuals using the behavioral scorecard model and infers whether the rejected individuals would have defaulted based on some predetermined cut-off score. The following are the steps to perform the hard cut-off augmentation method for reject inference:
- Build a behavioral scorecard model using the known defaulters and non-defaulters from the accepted applicants.
- Score the rejected applications with the behavioral scorecard model to obtain each rejected applicant’s probability of default and their score on the scorecard model.
- If the rejected applicants to accepted applicants ratio doesn’t match the population ratio, then create weighted cases for the rejected applicants. Similar to rare event modeling in classification models, we want to adjust the number of sampled rejects in comparison to our sampled accepts to accurately reflect the number of rejects in comparison to accepts from the population.
- Set a cut-off score level above which applicant is deemed a non-defaulter and below the applicant is deemed a defaulter.
- Add inferred defaulters and non-defaulters with known defaulters and non-defaulters and rebuild the scorecard.
Let’s see how to do this in our software!
R
Before scoring our rejected applicants, we need to perform the same data transformations on the reject dataset as we did on the accepts dataset. The following R code generates the same bins for our rejects dataset variables that we had in the accepts dataset so we can score these new observations. It also calculates each applicants scorecard score.
Next, we just use the predict
function to score the reject dataset. The first input is the model object from our behavioral model. The newdata =
option defines the reject dataset that we need to score. The type =
option defines that we want the predicted probability of default for each observation in the reject dataset. The next two lines of code create a bad and good variable in the rejects dataset based on the optimal cut-off defined in the previous section. The next few lines calculate the new weight for the observations in our data set accounting both for the rare event sampling as well as the accepts to rejects ratio in the population. Lastly, we combine our newly inferred rejected observations with our original accepted applicants for rebuilding our credit scoring model.
Parceling Augmentation
The parceling augmentation essentially scores all the rejected individuals using the behavioral scorecard model. However, instead of using a single cut-off, the parceling method splits the predicte scores into buckets (or parcels). The observations in these groups are randomly assigned to default or non-default based on that group’s rate of default in the accepts sample. The following are the steps to perform the parceling augmentation method for reject inference:
- Build a behavioral scorecard model using the known defaulters and non-defaulters from the accepted applicants.
- Score the rejected applications with the behavioral scorecard model to obtain each rejected applicant’s probability of default and their score on the scorecard model.
- If the rejected applicants to accepted applicants ratio doesn’t match the population ratio, then create weighted cases for the rejected applicants. Similar to rare event modeling in classification models, we want to adjust the number of sampled rejects in comparison to our sampled accepts to accurately reflect the number of rejects in comparison to accepts from the population.
- Define score ranges manually or automatically with simple bucketing.
- The inferred default status of the rejected applicants will be assigned randomly and proportional to the number defaulters and non-defaulters in the accepted sample within each score range.
- (OPTIONAL) If desired, apply an event rate increase factor to the probability of default for each bucket to increase in the proportion of defaulters among the rejects.
- Add the inferred defaulters and non-defaulters back in with the known defaulters and non-defaulters and rebuild the scorecard.
The chart below goes through an example for a bucket between the scores of 655 and 665.
Let’s see how to do this in our software!
R
Before scoring our rejected applicants, we need to perform the same data transformations on the reject dataset as we did on the accepts dataset. The R code in the hard cut-off section generates the same bins for our rejects dataset variables that we had in the accepts dataset so we can score these new observations. It also calculates each applicants scorecard score.
Next, we just use the seq
function to create buckets between 500 and 725 by groups of 25. We then use the cut
function to split our scored observations from each of the accepts and rejects datasets into these buckets. We then use the table
function to calculate the default rate of the accepts dataset in each bucket. We apply an optional event rate increase of 25% from the optional step 6 above. Next, we loop through each bucket and randomly assign defaulters and non-defaulters based on the accepts default rate in that bucket and added adjustment. Lastly, we combine our newly inferred rejected observations with our original accepted applicants for rebuilding our credit scoring model.
0 1
(500,525] 46 66
(525,550] 630 478
(550,575] 1166 410
(575,600] 1161 178
(600,625] 840 52
(625,650] 555 12
(650,675] 243 0
(675,700] 0 0
(700,725] 0 0
0 1
(500,525] 5 14
(525,550] 345 392
(550,575] 976 489
(575,600] 848 156
(600,625] 543 39
(625,650] 295 12
(650,675] 119 0
(675,700] 0 0
(700,725] 0 0
Fuzzy Augmentation
The fuzzy augmentation essentially scores all the rejected individuals using the behavioral scorecard model. It then creates two observations for each observations in the reject dataset. One observation is assigned as a defaulter, while the other a non-defaulter. These observations are then weighted based on the probability of default from the behavioral scorecard. The following are the steps to perform the fuzzy augmentation method for reject inference:
- Build a behavioral scorecard model using the known defaulters and non-defaulters from the accepted applicants.
- Score the rejected applications with the behavioral scorecard model to obtain each rejected applicant’s probability of default and their score on the scorecard model.
- Do not assign a reject to a default or non-default. Instead create two weighted cases for each rejected applicant using the probability of default and probability of non-default respectively.
- Multiply the probability of default and the probability of non-default by the user-specific rejection rate to form frequency variables.
- For each rejected applicant, create two observations - one observation has a frequency variable (rejection rate \(\times\) probability of default) and a target class of default; the other observation has a frequency variable (rejection weight \(\times\) probability of non-default) and a target class of non-default.
- Add inferred defaulters and non-defaulters back in with the known defaulters and non-defaulters and rebuild the scorecard.
Let’s see how to do this in our software!
R
Before scoring our rejected applicants, we need to perform the same data transformations on the reject dataset as we did on the accepts dataset. The R code in the hard cut-off section generates the same bins for our rejects dataset variables that we had in the accepts dataset so we can score these new observations. It also calculates each applicants scorecard score.
Next, we just use the predict
function to score the reject dataset. The first input is the model object from our behavioral model. The newdata =
option defines the reject dataset that we need to score. The type =
option defines that we want the predicted probability of default for each observation in the reject dataset. The next two lines of code create a good and bad version of the reject dataset. In the non-defaulter version of the rejects dataset, the target variable is assigned to non-default for all observations and the weight is calculated as described above. The opposite is done for the defaulter version of the rejects dataset. Lastly, we combine our newly inferred rejected observations with our original accepted applicants for rebuilding our credit scoring model.
Other Reject Inference Approaches
Although the three above approaches are the most widely accepted and used appproaches, there have been others proposed in the industry.
- Assign all rejects to default. This approach would only be valid if the current process for accepting loans is extremely good at actually determining who would and would not default as well as has a high acceptance rate (97% or higher). It is easy, but not highly recommended for potential biases.
- Randomly assign rejects in the same proportion of defaulters and non-defaulters as reflected in the accepted applicant dataset. The only problem here is this approach implies that our rejected applicants are the same as the accepted applicants which makes our current process rather random and ineffective.
- Similar in-house model on different data. If the rejected applicants have other loans with the institution, you could use their default probability from the other product’s behavioral scoring model.
- Approve all applicants for certain period of time. Although this approach is unbiased in theory, it is rather impractical in reality as it would likely not pass regulation.
- Clustering algorithms (unsupervised learning) to group rejected applicants and accepted applicants into clusters. The rejects would then be randomly assigned the default rate within a similar cluster. This is a similar approach to parceling.
Final Scorecard
Now that we have built the initial scorecard and accounted for our reject inference problem, we can move on to the building of the final application scorecard.
Building the Final Scorecard
The mechanics of building the final scorecard model are identical with the initial with the initial scorecard creation except that analysis is performed after reject inference.
Let’s see how to do this in our software!
R
The first line of code is a place holder for whatever type of reject inference you used from above. Here, we are using the dataset after using the parceling augmentation method of reject inference.
Next, we go through all the normal steps of model building.
- Separate into training and testing datasets
- Evaluate variables based on information value
- Bin the variables with \(IV \ge 0.1\)
- Transform the variables into weight of evidence representations
- Build the logistic regression model
- Evaluate the regression model on training and testing datasets
- Allocate the points for the scorecard
Only certain pieces of the output are shown to keep the output at a minimum.
comb <- comb_parc
# Step 1. Separate into training and testing datasets #
set.seed(12345)
train_id <- sample(seq_len(nrow(comb)), size = floor(0.75*nrow(comb)))
train_comb <- comb[train_id, ]
test_comb <- comb[-train_id, ]
# Step 2. Evaluate variables based on information value #
iv_summary <- smbinning.sumiv(df = train_comb, y = "good")
|
| | 0%
|
|-- | 4%
|
|---- | 9%
|
|------- | 13%
|
|--------- | 17%
|
|----------- | 22%
|
|------------- | 26%
|
|--------------- | 30%
|
|----------------- | 35%
|
|-------------------- | 39%
|
|---------------------- | 43%
|
|------------------------ | 48%
|
|-------------------------- | 52%
|
|---------------------------- | 57%
|
|------------------------------ | 61%
|
|--------------------------------- | 65%
|
|----------------------------------- | 70%
|
|------------------------------------- | 74%
|
|--------------------------------------- | 78%
|
|----------------------------------------- | 83%
|
|------------------------------------------- | 87%
|
|---------------------------------------------- | 91%
|
|------------------------------------------------ | 96%
|
|--------------------------------------------------| 100%
# Step 3. Bin the variables with IV >= 0.1 #
num_names <- names(train_comb)[sapply(train_comb, is.numeric)]
result_all_sig <- list() # Creating empty list to store all results #
for(i in 1:length(num_names)){
check_res <- smbinning(df = train_comb, y = "good", x = num_names[i])
if(check_res == "Uniques values < 5") {
next
}
else if(check_res == "No significant splits") {
next
}
else if(check_res$iv < 0.1) {
next
}
else {
result_all_sig[[num_names[i]]] <- check_res
}
}
# Step 4. Transform the variables into weight of evidence representations #
for(i in 1:length(result_all_sig)) {
train_comb <- smbinning.gen(df = train_comb, ivout = result_all_sig[[i]], chrname = paste(result_all_sig[[i]]$x, "_bin", sep = ""))
}
for (j in 1:length(result_all_sig)) {
for (i in 1:nrow(train_comb)) {
bin_name <- paste(result_all_sig[[j]]$x, "_bin", sep = "")
bin <- substr(train_comb[[bin_name]][i], 2, 2)
woe_name <- paste(result_all_sig[[j]]$x, "_WOE", sep = "")
if(bin == 0) {
bin <- dim(result_all_sig[[j]]$ivtable)[1] - 1
train_comb[[woe_name]][i] <- result_all_sig[[j]]$ivtable[bin, "WoE"]
} else {
train_comb[[woe_name]][i] <- result_all_sig[[j]]$ivtable[bin, "WoE"]
}
}
}
# Step 5. Build the logistic regression model #
final_score <- glm(data = train_comb, bad ~ tot_tr_WOE +
tot_derog_WOE +
age_oldest_tr_WOE +
tot_rev_line_WOE +
rev_util_WOE +
bureau_score_WOE +
ltv_WOE
, weights = train_comb$weight_ar, family = "binomial")
summary(final_score)
Call:
glm(formula = bad ~ tot_tr_WOE + tot_derog_WOE + age_oldest_tr_WOE +
tot_rev_line_WOE + rev_util_WOE + bureau_score_WOE + ltv_WOE,
family = "binomial", data = train_comb, weights = train_comb$weight_ar)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.1674 -0.9320 -0.5407 -0.1769 4.3996
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.82279 0.02445 -115.442 < 0.0000000000000002 ***
tot_tr_WOE 0.02077 0.08722 0.238 0.812
tot_derog_WOE -0.08350 0.06033 -1.384 0.166
age_oldest_tr_WOE -0.42120 0.05744 -7.333 0.000000000000225 ***
tot_rev_line_WOE -0.40164 0.05411 -7.422 0.000000000000115 ***
rev_util_WOE -0.21023 0.04947 -4.250 0.000021397754860 ***
bureau_score_WOE -0.79630 0.03797 -20.971 < 0.0000000000000002 ***
ltv_WOE -0.91261 0.07763 -11.755 < 0.0000000000000002 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 18052 on 7551 degrees of freedom
Residual deviance: 16105 on 7544 degrees of freedom
AIC: 17543
Number of Fisher Scoring iterations: 5
# Step 6. Evaluate the regression model on training and testing datasets #
train_comb$pred <- final_score$fitted.values
smbinning.metrics(dataset = train_comb, prediction = "pred", actualclass = "bad", report = 1)
Overall Performance Metrics
--------------------------------------------------
KS : 0.3934 (Fair)
AUC : 0.7555 (Fair)
Classification Matrix
--------------------------------------------------
Cutoff (>=) : 0.0591 (Optimal)
True Positives (TP) : 1309
False Positives (FP) : 2115
False Negatives (FN) : 421
True Negatives (TN) : 3707
Total Positives (P) : 1730
Total Negatives (N) : 5822
Business/Performance Metrics
--------------------------------------------------
%Records>=Cutoff : 0.4534
Good Rate : 0.3823 (Vs 0.2291 Overall)
Bad Rate : 0.6177 (Vs 0.7709 Overall)
Accuracy (ACC) : 0.6642
Sensitivity (TPR) : 0.7566
False Neg. Rate (FNR) : 0.2434
False Pos. Rate (FPR) : 0.3633
Specificity (TNR) : 0.6367
Precision (PPV) : 0.3823
False Discovery Rate : 0.6177
False Omision Rate : 0.1020
Inv. Precision (NPV) : 0.8980
Note: 0 rows deleted due to missing data.
for(i in 1:length(result_all_sig)) {
test_comb <- smbinning.gen(df = test_comb, ivout = result_all_sig[[i]], chrname = paste(result_all_sig[[i]]$x, "_bin", sep = ""))
}
for (j in 1:length(result_all_sig)) {
for (i in 1:nrow(test_comb)) {
bin_name <- paste(result_all_sig[[j]]$x, "_bin", sep = "")
bin <- substr(test_comb[[bin_name]][i], 2, 2)
woe_name <- paste(result_all_sig[[j]]$x, "_WOE", sep = "")
if(bin == 0) {
bin <- dim(result_all_sig[[j]]$ivtable)[1] - 1
test_comb[[woe_name]][i] <- result_all_sig[[j]]$ivtable[bin, "WoE"]
} else {
test_comb[[woe_name]][i] <- result_all_sig[[j]]$ivtable[bin, "WoE"]
}
}
}
test_comb$pred <- predict(final_score, newdata=test_comb, type='response')
smbinning.metrics(dataset = test_comb, prediction = "pred", actualclass = "bad", report = 1)
Overall Performance Metrics
--------------------------------------------------
KS : 0.3892 (Fair)
AUC : 0.7495 (Fair)
Classification Matrix
--------------------------------------------------
Cutoff (>=) : 0.0609 (Optimal)
True Positives (TP) : 427
False Positives (FP) : 707
False Negatives (FN) : 141
True Negatives (TN) : 1243
Total Positives (P) : 568
Total Negatives (N) : 1950
Business/Performance Metrics
--------------------------------------------------
%Records>=Cutoff : 0.4504
Good Rate : 0.3765 (Vs 0.2256 Overall)
Bad Rate : 0.6235 (Vs 0.7744 Overall)
Accuracy (ACC) : 0.6632
Sensitivity (TPR) : 0.7518
False Neg. Rate (FNR) : 0.2482
False Pos. Rate (FPR) : 0.3626
Specificity (TNR) : 0.6374
Precision (PPV) : 0.3765
False Discovery Rate : 0.6235
False Omision Rate : 0.1019
Inv. Precision (NPV) : 0.8981
Note: 0 rows deleted due to missing data.
# Step 7. Allocate the points for the scorecard #
pdo <- 20
score <- 600
odds <- 50
fact <- pdo/log(2)
os <- score - fact*log(odds)
var_names <- names(final_score$coefficients[-1])
for(i in var_names) {
beta <- final_score$coefficients[i]
beta0 <- final_score$coefficients["(Intercept)"]
nvar <- length(var_names)
WOE_var <- train_comb[[i]]
points_name <- paste(str_sub(i, end = -4), "points", sep="")
train_comb[[points_name]] <- -(WOE_var*(beta) + (beta0/nvar))*fact + os/nvar
}
colini <- (ncol(train_comb)-nvar + 1)
colend <- ncol(train_comb)
train_comb$Score <- rowSums(train_comb[, colini:colend])
for(i in var_names) {
beta <- final_score$coefficients[i]
beta0 <- final_score$coefficients["(Intercept)"]
nvar <- length(var_names)
WOE_var <- test_comb[[i]]
points_name <- paste(str_sub(i, end = -4), "points", sep="")
test_comb[[points_name]] <- -(WOE_var*(beta) + (beta0/nvar))*fact + os/nvar
}
colini <- (ncol(test_comb)-nvar + 1)
colend <- ncol(test_comb)
test_comb$Score <- rowSums(test_comb[, colini:colend])
accepts_scored_comb <- rbind(train_comb, test_comb)
Business Evaluation and Cut-off Selection
The above models have been evaluated with statistical metrics like KS and AUC. However, it is good to look at more business-type evaluations of the model as well. One of the primary ways to do this is to create a default decile plot. A default decile plot buckets the predicted scores from the scorecard into 10 equally sized (same number of applicants) buckets. From there, we evaluate the true default rate of the individuals in each bucket. If our model does a good job of separating the defaulters from the non-defaulters, we should see a consistent drop in default rate as the scores get higher as seen below:
[518,538] (538,547] (547,555] (555,562] (562,571] (571,582] (582,594] (594,609]
18.87 12.47 10.18 7.99 6.16 3.95 3.42 2.06
(609,631] (631,682]
1.18 0.40
The next step is to decide a decision cut-off value for the scorecard. Above, this cut-off, someone is approved for a loan, while below someone is rejected. The new scorecard should be better than the last method in terms of at least one fo the following:
- Lower default rate for the same approval rate
- Higher approval rate for the same default rate
- Highest amount of profit available
To address the first two points above we plot the acceptance rate by the default rate across different levels of cut-off (values of the scorecard) and compare. The interactive plot below shows an example of this:
We can move our cursor along the plots to see what the default rate and acceptance rate combination are at any scorecard cut-off. We can see that with our current acceptance rate of 70%, we have a lower default rate of approximately 2.9% at a cut-ff of 559 for a score. We could also use a cut-off of 537 to keep our default rate close to 5%, but raise our acceptance rate to 93.4%.
We could also balance the above choices with profit. Everytime we make a correct decision and give a loan to an individual who pays us back, we make approximately $1,200 on average in profit. However, every mistake we make where a customer defaults, we lose $50,000 on average. The number of people who get loans who default is controlled by the cut-off. Similar to the plot above we can plot the acceptance rate by the average profit across different levels of cut-off (values of the scorecard) and compare. The interactive plot below shows an example of this:
We can move our cursor along the plots to see what the profit and acceptance rate combination are at any scorecard cut-off. We can see that with our current acceptance rate of 70%, we are barely making a profit at a cut-ff of 559 for a score. We could also use a cut-off of 596 to maximize our profit, but our acceptance rate falls to roughly 33%.
From these two plots we can make a more informed decision on the cut-off process for our model. A good strategy might be to use two cut-offs. The first, would be 596 and above would be an automatic acceptance to ensure we get profitable customers. Below a cut-off of 537 and the applicant is automatically rejected due to high risk of default beyond the bank’s comfort. The range in the middle can be sent to a team for further investigation into whether the loan is worth lending.
Let’s see how to do this in software!
R
There are a variety of different ways to plot the plots described in the previous section. The following code generates the plots you have seen above:
# Decile Default Plot #
cutpoints <- quantile(accepts_scored_comb$Score, probs = seq(0,1,0.10))
accepts_scored_comb$Score.QBin <- cut(accepts_scored_comb$Score, breaks=cutpoints, include.lowest=TRUE)
Default.QBin.pop <- round(table(accepts_scored_comb$Score.QBin, accepts_scored_comb$bad)[,2]/(table(accepts_scored_comb$Score.QBin, accepts_scored_comb$bad)[,2] + table(accepts_scored_comb$Score.QBin, accepts_scored_comb$bad)[,1]*4.75)*100,2)
print(Default.QBin.pop)
barplot(Default.QBin.pop,
main = "Default Decile Plot",
xlab = "Deciles of Scorecard",
ylab = "Default Rate (%)", ylim = c(0,20),
col = saturation(heat.colors, scalefac(0.8))(10))
abline(h = 5, lwd = 2, lty = "dashed")
text(11, 6, "Current = 5.00%")
# Calculations of Default, Acceptance Rate, and Profit by Cut-off Score #
def <- NULL
acc <- NULL
prof <- NULL
score <- NULL
cost <- 50000
profit <- 1500
for(i in min(floor(train_comb$Score)):max(floor(train_comb$Score))){
score[i - min(floor(train_comb$Score)) + 1] <- i
def[i - min(floor(train_comb$Score)) + 1] <- 100*sum(train_comb$bad[which(train_comb$Score >= i)])/(length(train_comb$bad[which(train_comb$Score >= i & train_comb$bad == 1)]) + 4.75*length(train_comb$bad[which(train_comb$Score >= i & train_comb$bad == 0)]))
acc[i - min(floor(train_comb$Score)) + 1] <- 100*(length(train_comb$bad[which(train_comb$Score >= i & train_comb$bad == 1)]) + 4.75*length(train_comb$bad[which(train_comb$Score >= i & train_comb$bad == 0)]))/(length(train_comb$bad[which(train_comb$bad == 1)]) + 4.75*length(train_comb$bad[which(train_comb$bad == 0)]))
prof[i - min(floor(train_comb$Score)) + 1] <- length(train_comb$bad[which(train_comb$Score >= i & train_comb$bad == 1)])*(-cost) + 4.75*length(train_comb$bad[which(train_comb$Score >= i & train_comb$bad == 0)])*profit
}
plot_data <- data.frame(def, acc, prof, score)
# Plot of Acceptance Rate by Default Rate #
ay1 <- list(
title = "Default Rate (%)",
range = c(0, 10)
)
ay2 <- list(
tickfont = list(),
range = c(0, 100),
overlaying = "y",
side = "right",
title = "Acceptance Rate (%)"
)
fig <- plot_ly()
fig <- fig %>% add_lines(x = ~score, y = ~def, name = "Default Rate (%)")
fig <- fig %>% add_lines(x = ~score, y = ~acc, name = "Acceptance Rate (%)", yaxis = "y2")
fig <- fig %>% layout(
title = "Default Rate by Acceptance Across Score", yaxis = ay1, yaxis2 = ay2,
xaxis = list(title="Scorecard Value"),
legend = list(x = 1.2, y = 0.8)
)
fig
# Plot of Acceptance Rate by Profit #
ay1 <- list(
title = "Profit ($)",
showline = FALSE,
showgrid = FALSE
)
ay2 <- list(
tickfont = list(),
range = c(0, 100),
overlaying = "y",
side = "right",
title = "Acceptance Rate (%)"
)
fig <- plot_ly()
fig <- fig %>% add_lines(x = ~score, y = ~prof, name = "Profit ($)")
fig <- fig %>% add_lines(x = ~score, y = ~acc, name = "Acceptance Rate (%)", yaxis = "y2")
fig <- fig %>% layout(
title = "Profit by Acceptance Across Score", yaxis = ay1, yaxis2 = ay2,
xaxis = list(title="Scorecard Value"),
legend = list(x = 1.2, y = 0.8)
)
fig
Credit Scoring Model Extensions
With the growth of machine learning models, the credit modeling industry is adapting their process on building scorecards. Below are some extensions that have been proposed in literature.
One extension to scorecard modeling is a multi-stage approach. Decision trees (and most tree-based algorithms) have the benefit of including interactions at every split of the tree. However, this also makes interpretation a little harder for scorecards as you would have scorecard points changing at every branch of the tree. In the multi-stage model approach, the first stage is to build a decision tree on the dataset to get an initial couple of layers and splits. The second stage is to build logistic regression based scorecards in each of the limited number of initial splits from the decision tree. The interpretation is now within a split (sub-group) of the dataset.
For credit modeling, model interpretation is key. This makes the use of machine learning algorithms hard to pass by regulators. Scorecard layerse on top of machine learning algorithms help drive the interpretation of the algorithm, however, regulators are still hesitant. That doesn’t mean that we can not use these techniques. These techniques are still very valuable for internal comparison and variable selection. For example, you could build a neural network, tree-based algorithm, boosting algorithm, etc. to see if that model is statistically different than the logistic regression scorecard. If so, what variables might be different between the models that the scorecard might be able to add in through some kind of feature engineering. However, empirical studies have shown that scorecards built on logistic regression models with good feature engineering for the variables are still hard to out perform, even with advanced machine learning models.