Comparing Models

One of the common concerns and questions in any model development is determining how “good” a model is or how well it performs. There are many different factors that determine this and most of them depend on the goals for the model. There are typically two different purposes for modeling - estimation and prediction. Estimation quantifies the expected change in our target variable associated with some relationship or change in the predictor variables. Prediction on the other hand is focused on predicting new target observations. However, these goals are rarely seen in isolation as most people desire a blend of these goals for their models. This section will cover many of the popular metrics for model assessment.

The first thing to remember about model assessment is that a model is only “good” in context with another model. All of these model metrics are truly model comparisons. Is an accuracy of 80% good? Depends! If the previous model used has an accuracy of 90%, then no the new model is not good. However, if the previous model used has an accuracy of 70%, then yes the model is good. Although we will show many of the calculations, at no place will we say that you must meet a certain threshold for your models to be considered “good” because these metrics are designed for comparison.

Some common model metrics are based on deviance/likelihood calculations. Three common logistic regression metrics based on these are the following:

  1. AIC
  2. BIC
  3. Generalized (Nagelkerke) \(R^2\)

Without going into too much mathematical detail, the AIC is a crude, large sample approximation of leave-one-out cross validation. The BIC on the other hand favors a smaller model than the AIC as it penalizes model complexity more. In both AIC and BIC, lower values are better. However, there is no amount of lower that is better enough. Neither the AIC or BIC is necessarily better than the other however they may not always agree on the “best” model.

There are a number of “pseudo”-\(R^2\) metrics for logistic regression. Here, higher values indicate a better model. The Generalized (Nagelkerke) \(R^2\) is a metric to measure how much better a model (in terms of likelihood) is compared to the intercept only model. Therefore, we compare two models with this to see which one is “more better” than the intercept compared to the other. Essentially, they are both compared to the same baseline so whichever beats that baseline by more is the model we want. Even though it is bounded between 0 and 1, there is no interpretation to this metric like we had in linear regression.

We will be using the Ames, Iowa housing data set for this section. Let’s see how we get these metrics in each of our softwares!

Probability Metrics

Logistic regression is a model for predicting the probability of an event, not the occurrence of an event. Logistic regression can be used for classification as well. Good models should reflect both good metrics on probability and classification, but the importance of one over the other depends on the problem.

In this section we will focus on the probability metrics. Since we are predicting the probability of an event, we want our model to assign higher predicted probabilities to events and lower predicted probabilities to non-events.

Coefficient of Discrimination

The coefficient of discrimination (or discrimination slope) is the difference in average predicted probabilities of the actual events and non-events.

\[ D = \bar{\hat{p}}_1 - \bar{\hat{p}}_0 \]

The bigger this difference, this better our model does as separating the events and non-events through the predicted probabilities. We can also examine the histograms of these predictions for comparison as well.

Let’s see how to calculate this in each of our softwares!

Rank-Order Statistics

Rank-order statistics measure how well a model orders the predicted probabilities. Three common metrics that summarize things together are concordance, discordance, and ties. In these metrics every single combination of an event and non-event are compared against each other (1 event vs. 1 non-event). A concordant pair is a pair with the event having a higher predicted probability than the non-event - the model got the rank correct. A discordant pair is a pair with the event having a lower predicted probability than the non-event - the model got the rank wrong. A tied pair is a pair where the event and non-event have the same predicted probability - the model isn’t sure how to distinguish between them. Models with higher concordance are considered better. The interpretation on concordance is that for all possible event and non-event combinations, the model assigned the higher predicted probability to the observation with the event concordance% of the time.

There are a host of other metrics that are based on these rank-statistics such as the \(c\)-statistic, Somer’s D, and Kendall’s \(\tau_\alpha\). The calculations for these are as follows:

\[ c = Concordance + 1/2\times Tied \]

\[ D_{xy} = 2c - 1 \]

\[ \tau_\alpha = \frac{Condorant - discordant}{0.5*n*(n-1)} \]

With all of these, higher values of concordant pairs result in higher values of these metrics.

Let’s see how to calculate these in each of our softwares!

Classification Metrics

Logistic regression is a model for predicting the probability of an event, not the occurrence of an event. Logistic regression can be used for classification as well. Good models should reflect both good metrics on probability and classification, but the importance of one over the other depends on the problem.

In this section we will focus on the classification metrics. We want a model to correctly classify events and non-events. Classification forces the model to predict either an event or non-event for each observation based on the predicted probability for that observation. For example, \(\hat{y}_1 = 1\) if \(\hat{p}_i > 0.5\). These are called cut-offs or thresholds. However, strict classification-based measures completely discard any information about the actual quality of the model’s predicted probabilities.

Many of the metrics around classification try to balance different pieces of the classification table (also called the confusion matrix). An example of one is shown below.

Classification Table Example

Let’s examine the different pieces of the classification table that people jointly focus on.

Sensitivity & Specificity

Sensitivity is the proportion of times you were able to predict an event in the groups of actual events. Of the actual events, the proportion of the time you correctly predicted an event. This is also called the true positive rate. This is also just another name for recall.

Example Calculation of Sensitivity

This is balanced typically with the specificity. Specificity is the proportion of times you were able to predict a non-event in the group of actual non-events. Of the actual non-events, the proportion of the time you correctly predicted non-event. This is also called the true negative rate.

Example Calculation of Specificity

These offset each other in a model. One could easily maximize one of these at the cost of the other. To get maximum sensitivity you can just predict every observations is an event, however this would drop your specificity to 0. The reverse is also true. Those who focus on sensitivity and specificity want balance in each. One measure for the optimal cut-off from a model is the Youden’s Index (or Youden’s J Statistic). This is easily calculated as \(J = sensitivity + specificity - 1\). The optimal cut-off for determining predicted events and non-events would be at the point where this is maximized.

Let’s see how to do this in each of our softwares!

K-S Statistic

One of the most popular metrics for classification models in the finance and banking industry is the KS statistic. The two sample KS statistic can determine if there is a difference between two cumulative distribution functions. The two cumulative distribution functions of interest to us are the predicted probability distribution functions for the event and non-event target group. The KS \(D\) statistic is the maximum distance between these two curves - calculated by the maximum difference between the true positive and false positive rates, \(D = \max_{depth}{(TPR - FPR)} = \max_{depth}{(Sensitivity + Specificity - 1)}\). Notice, this is the same as maximizing the Youden Index.

The optimal cut-off for determining predicted events and non-events would be at the point where this \(D\) statistic (Youden Index) is maximized.

Let’s see how to do this in each of our softwares!

Precision & Recall

Precision and recall are another way to view a classification table from a model. Recall is the proportion of times you were able to predict an event in the groups of actual events. Of the actual events, the proportion of the time you correctly predicted an event. This is also called the true positive rate. This is also just another name for sensitivity.

Example Recall Calculation

This is balanced here with the precision. Precision is the proportion predicted events that were actually events. Of the predicted events, the proportion of the time they actually were events. This is also called the positive predictive value. Precision is growing in popularity as a balance to recall/sensitivity as compared to specificity.

Example Precision Calculation

These offset each other in a model. One could easily maximize one of these at the cost of the other. To get maximum recall you can just predict all events, however this would drop your precision. Those who focus on precision and recall want balance in each. One measure for the optimal cut-off from a model is the F1 Score. This is calculated as the following:

\[ F_1 = 2\times \frac{precision \times recall}{precision + recall} \]

The optimal cut-off for determining predicted events and non-events would be at the point where this is maximized.

Let’s see how to do this in each of our softwares!

Accuracy & Error

Accuracy and error rate are typically thought of when it comes to measuring the ability of a logistic regression model. Accuracy is essentially what percentage of events and non-events were predicted correctly.

Example Accuracy Calculation

The error would be the opposite of this.

Example Error Calculation

However, caution should be used with these metrics as they can easily be fooled if only focusing on them. If your data has 10% events and 90% non-events, you can easily have a 90% accurate model by guessing non-events for every observation. Instead, having less accuracy might be better if we can predict both events and non-events. These numbers are still great to report! They are just not the best to decide which model is best.

Let’s see how to do this in each of our softwares!

Optimal Cut-off

Classification is a decision that is extraneous to statistical modeling. Although logistic regressions tend to work well in classification, it is a probability model and does not output events and non-events.

Classification assumes cost for each observation is the same, which is rarely true. It is always better to consider the costs of false positives and false negatives when considering cut-offs in classification. The previous sections talk about many ways to determine “optimal” cut-offs when costs are either not known or not necessary. However, if costs are known, they should drive the cut-off decision rather than modeling metrics that do not account for cost.