Logistic Regression Example

Now that we have transformed our variables for modeling, we can start with the process of building our model. In building credit models, we first build an initial credit scoring model. In each of the models that we build we must take the following three steps:

  1. Build the model
  2. Evaluate the model
  3. Convert the model to a scorecard

The scorecard is typically based on a logistic regression model:

\[ logit(p) = \log(\frac{p}{1-p}) = \beta_0 + \beta_1 x_1 + \cdots + \beta_k x_k \]

In the above equation, \(p\) is the probability of default given the inputs in the model. However, instead of using the original variables for the model, credit scoring models and their complimenting scorecards are based on binned variables as their foundation.

Instead of using the original variables for the model, scorecard models have the binned variables as their foundation. There are two different approaches to handling this:

  1. WoE approach - use the WOE scores as new continuous variables
  2. Binned approach - use binned variables as new variables

These two approaches can be best seen by looking at some observations from our dataset as seen below:

There are advantages and disadvantages of both techniques.

For the WoE approach, all of the variables are treated as continuous which reduces the size of the model. All of these variables are now on the same WoE scale which makes the variables comparable in terms of their variable importance (the logistic regression coefficients here). This is the default approach in Python’s OptBinning package detailed below.

For the binned approach, the variables are all now categorical in nature. That means that we have many more variables since they have to be coded (one-hot encoding for example) to get them into the model. However, this approach is the programmed approach in R which makes the implementation much easier in that software.

Now that we have our logistic credit scoring model built we can evaluate them before adding on the scorecard points.

Model Evaluation

Credit models are evaluated as most classification models. Overall model performance is typically evaluated on area under the ROC curve as well as the K-S statistic.

Let’s see how to perform this in our software!

Just in case you were curious how the traditional approach to developing variables using WOE values would have performed instead of the “modern” approach of using the binned variables, here is a comparison:

As you can see, they are very similar in their performance.

Scaling the Scorecard

The last step of the credit modeling process is building the scorecard itself. To create the scorecard we need to relate the predicted odds from our logistic regression model to the scorecard. The relationship between the odds and scores is represented by a linear function:

\[ Score = Offset + Factor \times \log(odds) \]

All that we need to define is the amount of points to double the odds (called PDO) and the corresponding scorecard points. From there we have the following extra equation:

\[ Score + PDO = Offset + Factor \times \log(2 \times odds) \]

Through some basic algebra, the solution to the \(Factor\) and \(Offset\) is shown to be:

\[ Factor = \frac{PDO}{\log(2)} \]

\[ Offset = Score - Factor \times \log(odds) \]

For example, if a scorecard were scaled where the developer wanted odds of 50:1 at 600 points and wanted the PDO = 20. Through the above equations we calculate \(Factor = 28.85\) and \(Offset = 487.12\). Therefore, the corresponding score for each predicted odds from the logistic regression model is calculated as:

\[ Score = 487.12 + 28.85\times \log(odds) \]

For this example, we would then calculate the score for each individual in our dataset. Notice how the above equation has the \(\log(odds)\) which is the prediction from a logistic regression model \(\log(odds) = \hat{\beta}_0 + \hat{\beta}_1 x_1 \cdots\). This is one of the reasons logistic regression is still very popular in the credit modeling world.

The next step in the scorecard is to allocate the scorecard points to each of the categories (bins) in each of the variables. The points that are allocated to the \(i^{th}\) bin of variable \(j\) are computed as follows:

\[ Points_{i,j} = -(WOE_{i,j} \times \hat{\beta}_j + \frac{\hat{\beta}_0}{L}) \times Factor + \frac{Offset}{L} \]

The \(WOE_{i,j}\) is the weight of evidence of the \(i^{th}\) bin of variable \(j\). The coefficient of the variable \(j\), \(\hat{\beta}_j\), as well as the intercept \(\hat{\beta}_0\), come from the logistic regression model. \(L\) is the number of variables in the model.

Let’s see this in our software!

Looking at the histogram below we see that our good customers tend to have higher scores than our bad customers. The vertical dashed lines represent the average score for each group. This is exactly what we are looking for!

This initial credit scoring model based only on the applicants who got loans (accepted applicants) is called a behavioral scorecard because it models the behavior of your current loans.