Variable Binning and Selection

Feature Creation

Feature creation and selection is one of the most important pieces to any modeling process. It is no different for credit score modeling. Before selecting the variables, we need to transform them. Specifically in credit score modeling, we need to take our continuous variables and bin them into categorical versions.

Variable Grouping

Scorecards end up with only just bins within a variable. There are two primary objectives when deciding on how to bin the variables:

  1. Eliminate weak variables or those that do not conform to good business logic.

  2. Group the strongest variables’ attribute levels (values) in order to produce a model in the scorecard format.

Binning continuous variables help simplify analysis. We no longer need to explain coefficients that imply some notion of constant effect or linearity, but instead are just comparisons of categories. This process of binning also models non-linearity in an easily interpretable way. We are not restricted to linearity of the continuous variables as some models assume. Outliers are also easily accounted for as they are typically contained within the smallest or largest bin. Lastly, missing values are no longer a problem and do not need imputation. Missing values can get their own bin making all observations available to be modeled.

There are a variety of different approaches to statistically bin variables. We will focus on the two most popular ones here:

  1. Prebinning and Combining of Bins

  2. Decision / Conditional Inference Trees

The first is by prebinning the variables followed by the grouping of these bins. Imagine you had a variable whose range is from 3 to 63. This approach would first break this variable into quantile bins. Softwares typically use anywhere from 20 to 100 equally sized quantiles for this initial step. From there, we use chi-square tests to compare each adjacent pair of bins. If the bins are statistically the same with respect to the target variable using two by two contingency table Chi-square tests (Mantel-Haenzel for example), then we combine the bins. We repeat this process until no more adjacent pairs of bins can be statistically combined. Below is a visual of this process.

The second common approach is through decision / conditional inference trees. The classical CART decision tree uses the Gini statistic to find the best splits for a single variable predicting the target variable. In this scenario, you have one single predictor variable as the only variable in the decision tree. Each possible split is evaluated and determined based on the Gini statistic to make the split that occurs have the highest measure of purity. This process is repeated until no further splits are possible. An example of this is show below.

Some softwares and packages use conditional inference trees instead of decision trees. These are a variation on the common CART decision tree. CART methods for decision trees potentially have inherent bias - variables with more levels are more likely to be split on if split using the Gini and entropy criterion. Conditional inference trees on the other hand add an extra step to this process. Conditional inference trees evaluate which variable is most significant first, then evaluate what is the best split of a continuous variable through the Chi-square decision tree approach on that specific variable only, not all variables. They repeat this process until no more significant places in the variable are left to be split. How does this apply to binning though? When binning a continuous variable, we are predicting the target variable using only our one continuous variable in the conditional inference tree. It evaluates if the variable is significant at predicting the target variable. If so, it finds the most significant statistical split using Chi-square tests in between each value of the continuous variable and then comparing the two groups formed by this split. After finding the most significant split you have two continuous variables - one below the split and one above. The process repeats itself until the algorithm can no longer find significant splits leading to the definition of your bins. Below is a visual of this process.

Cut-offs (or cut points) from the decision tree algorithms might be rather rough. Sometimes we override the automatically generated cut points to more closely conform to business rules. These overrides might make the bins sub-optimal, but hopefully not too much to impact the analysis.

Imagine a similar scenario for linear regression. Suppose you had two models with the first model having an \(R^2 = 0.8\) and the second model having an \(R^2 = 0.78\). However, the second model made more intuitive business sense than the first. You would probably choose the second model willing to sacrifice a small amount of predictive power for a model that made more intuitive sense. The same can be thought of when slightly altering the bins from these two approaches described above.

Let’s see how each of our softwares approaches binning continuous variables!

Weight of Evidence (WOE)

Weight of evidence (WOE) measures the strength of the attributes (bins) of a variable in separating events and non-events in a binary target variable. In credit scoring, that implies separating bad and good accounts respectively.

Weight of evidence is based on comparing the proportion of goods to bads at each bin level and is calculated as follows for each bin within a variable:

\[ WOE_i = \log(\frac{Dist. Good_i}{Dist.Bad_i}) \]

The distribution of goods for each bin is the number of goods in that bin divided by the total number of goods across all bins. The distribution of bads for each bin is the number of bads in that bin divided by the total number of bads across all bins. An example is shown below:

WOE summarizes the separation between events and non-events (bads and goods) as shown in the following table:

For WOE we are looking for big differences in WOE between bins.

Ideally, we would like to see monotonic increases for variables that have ordered bins. This isn’t always required as long as the WOE pattern in the bins makes business sense. However, if a variable’s bins go back and forth between positive and negative WOE values across bins, then the variable typically has trouble separating goods and bads. Graphically, the WOE values for all the bins in the bureau_score variable look as follows with the line plot below:

The histogram in the plot above also displays the distribution of events and non-events as the WoE values are changing. WOE approximately zero implies the distribution of non-events (goods) are approximately equal to the distribution of events (bads) so that bin doesn’t do a good job of separating these events and non-events. WOE of positives values implies the bin identifies observations that are non-events (goods), while WOE of negative values implies bin identifies observations that are events (bads).

One quick side note. WOE values can take a value of infinity or negative infinity when quasi-complete separation exists in a category (zero events or non-events). Some people adjust the WOE calculation to include a small smoothing parameter to make the numerator or denominator of the WOE calculation not equal to zero.

Let’s see how to get the weight of evidence values in each of our softwares!

Information Value (IV) & Variable Selection

Weight of evidence summarizes the individual categories or bins of a variable. However, we need a measure of how well all the categories in a variable do at separating the events from non-events. That is what information value (IV) is for. IV uses the WOE from each category as a piece of its calculation:

\[ IV = \sum_{i=1}^L (Dist.Good_i - Dist.Bad_i)\times \log(\frac{Dist.Good}{Dist.Bad}) \]

In credit modeling, IV is used in some instances to actually select which variables belong in the model. Here are some typical IV ranges for determining the strength of a predictor variable at predicting the target variable:

  • \(0 \le IV < 0.02\) - Not predictor
  • \(0.02 \le IV < 0.1\) - Weak predictor
  • \(0.1 \le IV < 0.25\) - Moderate (medium) predictor
  • \(0.25 \le IV\) - Strong predictor

Variables with information values greater than 0.1 are typically used in credit modeling.

Some resources will say that IV values greater than 0.5 might signal over-predicting of the target. In other words, maybe the variable is too strong of a predictor because of how that variable was originally constructed. For example, if all previous loan decisions have been made only on bureau score, then of course that variable would be highly predictive and possibly the only significant variable. In these situations, good practice is to build one model with only bureau score and another model without bureau score but with other important factors. We then ensemble these models together.

Let’s see how to get the information values for our variables in each of our softwares!