Model Agnostic Interpretability

Model Performance

Let’s compare all of the models we have built in this slide deck on that test dataset that we haven’t used yet.

As we can see in the above table, it appears that the most interpretable model - linear regression - is not the one that performs the best for this dataset. It is also not the worst, but the random forest model seems to far outperform it as well as most other models. We will use this random forest in the remaining sections to try and “interpret” our machine learning model.

“Interpretability”

Classical, statistical modeling may lend itself easier to interpretation. It is the interpretation that most people are used to with modeling. For example, in linear regression, we have the case of a straight line relationship:

In linear regression, our interpretation is that every single unit increase in x (the predictor variable) leads to a \(\beta\) unit increase in y (the target variable) on average while holding all other variables constant.

Decision trees are also considered interpretable models. They have the following step function form:

Again, we can interpret ranges of the predictor variable having a specific, estimated relationship with our target.

Most machine learning models are not interpretable in these classical approaches. This is mostly because of the nonlinearity of the relationship between the predictor variable and the target variable. Machine learning models are picking up on these more complicated relationships.

However, people (especially clients) want to interpret and understand model behavior. There are important questions that drive this need for interpretability:

  • Why is someone’s loan rejected?

  • Why is this symptom occurring in this patient?

  • Why is the stock price expected to decrease?

Interpretations in machine learning models can be model and/or context specific. Model dependent interpretations are things like variable importance. Variable importance in regression might be different than in tree-based models. A context specific interpretation deals more with the effects of a change in a single variable on a target variable.

These important types of interpretations help deal with concepts like fairness / transparency, model robustness / integrity, and legal requirements. With fairness / transparency we need to understand model decisions to improve client and customer trust. These interpretations reveal model behavior on different (and potentially marginalized) groups of people. Model robustness and integrity deals with revealing odd model behavior or overfitting problems where certain variable conclusions don’t make intuitive sense. Lastly, there are a number of fields that have legal requirements around models like the Equal Credit Opportunity Act (ECOA), the Fair Credit Reporting Act (FCRA), and more.

For machine learning model interpretability, most softwares have collections of two types of interpretations - local and global. These are visualized below:

The left plot above is an example of a local interpretation where we can say that y decreases as x increases when x = 10. The right hand plot would be more global where we could say that y tends to increase as x increases.

Interpreters are calculations applied to machine learning algorithms to help us interpret the models. Three popular global interpreters are permutation importance, partial dependence, and accumulated local effects (ALE). Three popular local interpreters are individual conditional expectations (ICE), local interpretable model agnostic explainations (LIME), and Shapley values. We will explore all of these in the sections below.

Permutation Importance

General Idea

“Let me show you how much worse the predictions of our model get if we input randomly shuffled data values for each variable.”

If a variable is important, the model should get worse when that variable is removed. To make a direct comparison, rather than remove a variable completely from a model, permutation importance will remove the signal from the variable. It does this by randomly shuffling (permuting) the values of the variable. This should break the true relationship between the variable and the target variable. Just to make sure we didn’t get lucky with the permutation and make the signal stronger on the variable, we will do multiple random permutations. For example, we will calculate how much worse does a model get when we take the average impact of 5 random permutations.

Let’s calculate this for each of our softwares!

Individual Conditional Expectation (ICE)

General Idea

“Let me show you how the predictions for each observation change if we vary the feature of interest.”

The individual conditional expectation (ICE) is a local interpreter that visualizes the dependence of an individual prediction on a given predictor variable. It fixes all the other variables for a single observation while changing the variable of interest and then plots the results to visualize the predictions vs the predictor variable.

Let’s look at this concept visually. First, we select both a variable of interest and a single observation. We will still look at our Garage_Area variable, but only for the first observation:

Next, we will replicate this single observation over and over into a new dataset while holding all the other variables constant and not allowing their values to change. We will then fill in values for the variable of interest across the entire range of the variable:

Lastly, we will use the model to predict the target variable in each of these simulated observations:

This will give us an idea of how that single observation is impacted by that variable we are interested in. We will then repeat this calculation for each of the observations in our dataset (or a large sample of them).

Let’s see this in each of our softwares!

One of the biggest disadvantages of ICE is the impact of multicollinearity. If the variable of interest is correlated with other predictor variables, some of the simulated data may be invalid. Let’s take a look at Garage_Area across values of first floor square footage:

Since we are fixing first floor square footage for a single observation and then simulating all of the possible values of Garage_Area, we could get some nonsensical values. As we can see by the horizontal dots above, a home with 2400 square feet on the first floor would reasonably have values of Garage_Area in between 500 and 1300 square feet. However, we are simulating these homes all the way down to no garage and up to values of 1500 square feet. Therefore, we must be careful when interpreting ICE predictions for a single observation as some of the predictions are extrapolating outside of the reasonable range of values.

Partial Dependence

General Idea

“Let me show you what the model predicts on average when each observation has the value k for that feature, regardless if that value makes sense.”

A partial dependence plot (PDP) is a global interpreter that attempts to show the marginal effect of predictor variables on the target variable. Marginal effects are averaged effects over all possible values of a single variable. Essentially, the PDP is the average of the ICE plots discussed in the section above.

Let’s look at this concept visually. First, choose a variable of interest. We will look at Garage_Area. Next, we will replicate the entire dataset while holding all variables constant except for the variable of interest:

Next, we will fill the variable of interest column in each dataset with one of the possible values (determined by the range of the data) of the predictor variable of interest:

We will use our model to generate predictions across all of the rows of each of the datasets. The predictions from each dataset will be averaged together and they will correspond to a particular value of the predictor variable of interest. That means we will have a dataset and prediction for each of the values in the range of Garage_Area which is 0 to 1488:

This will give us an idea of how that are target variable is impacted on average by that variable we are interested in. The visual of this will hopefully show the potentially complex relationship between that predictor variable and the target variable.

Let’s see this in each of our softwares!

One thing you should always be careful of when plotting our PDPs is the scale of the data. Most plotting functions across softwares will adjust the scale of the plot for you. This is not always helpful. Take a look at the plot below with the variable describing the month in which is the house is sold.

Code
pd_plot$plot(c("Mo_Sold"))

This looks like there is a big change across different months of the year. However, if we focus on the y-axis, we see that these changes all occur within a couple of thousand dollars.

Accumulated Local Effects (ALE)

General Idea

“Let me show how the model predictions change when I change the variables of interest to values within a small interval around their current values.”

Partial dependence plots are an average of ICE plots which we previously discussed have problems with multicollinearity. If the variable of interest is correlated with other predictor variables, some of the simulated data may be invalid. Let’s take a look at Garage_Area across values of first floor square footage:

Since we are fixing first floor square footage for a single observation and then simulating all of the possible values of Garage_Area, we could get some nonsensical values. As we can see by the horizontal dots above, a home with 2400 square feet on the first floor would reasonably have values of Garage_Area in between 500 and 1300 square feet. However, we are simulating these homes all the way down to no garage and up to values of 1500 square feet.

Instead of looking at all values like ICE, and therefore PDP, the accumulated local effects (ALE) global interpreter uses only reasonably contrived data to get a clearer picture of the relationship between a variable of interest and the target variable. By default, ALE uses quantiles of your data to define this reasonable range of values. For observations in each interval, we determine how much their prediction would change if we replace the feature of interest with the upper and lower bounds of the interval, while keeping all other variables constant. Let’s look at his visually:

With the ALE we will do this calculation for all of the observations in the interval. This allows us to understand how the predictions are changing within a reasonable window around the data values for the predictor variable.

Let’s see how to do this in each of our softwares!

Local Interpretable Model-Agnostic Explanations (LIME)

General Idea

“Let me show you a linear model that could explain the exact orientation of the predictive model at a specific point.”

The best way to understand the local interpreter LIME is through a visual. Imagine that you had a nonlinear relationship between a target and a predictor variable. Now imagine you zoom in really close to a specific point of interest as in the figure below:

That zoomed in area around the point of interest looks like an approximately straight line. We know that we can model (and interpret) straight lines with linear regression. This will help us understand the effects or impact of that variable of interest around our point of interest. Of course, we can expand this linear regression to include all of the variables and not just one.

Here are the basic steps of what LIME is doing:

  1. Randomly generate values (usually, normally distributed) for each variable in the model.
  2. Weight more heavily the fake observations that are near the real observation of interest.
  3. Build a weighted linear regression model based on fake observations and the observation (row in the dataset) of interest.
  4. “Interpret” the coefficients of variables and their “impact” from the linear regression model.

LIME is not actually limited to linear regression and we could use any interpretable model (decision tree for example). One of the biggest choices we have in LIME is the number of variables we use in the local linear regression model. Typically, we don’t use all of the variables in a local model like LIME as we are trying to focus on the main driving factors. However, the definition of “near the points of interest” is a very big and unsolved problem in the world of LIME.

Let’s see how to do this in each of our softwares!

Shapley Values

General Idea

“Let me show you the value the \(j^{th}\) feature contributed to the prediction of this particular observation compared to the average prediction of the whole dataset.”

As mentioned above, the general idea for the Shapley value local interpreter is to measure the “effect” of a specific variable on the difference in the prediction of a specific point compared to the overall prediction. This is best seen through an example. In Python, the predicted sale price of observation 1328 is $672,791.28 from our random forest. The average, predicted sale price of homes from the random forest is $180,628.03. Let’s now look at the Shapley values for observation 1328:

If we were to take all of the pieces you see above and sum them together, \(13,754.21 + 93,854.94 + \cdots + 6.14 = 492,163.25\). This is the exact difference between our predicted value of $672,791.28 and the overall prediction of $180,628.03. This means we can directly see how each variable impacts our specific prediction of observation 1328 in terms of the predicted sale price of the home.

This sounds great, but how does it work in the background? It goes back to the mathematical idea of game theory. Shapley (1953) assigned a payout values for players depending on their contribution to the total payout across a coalition (think team). In other words, imagine you have a team of basketball players playing in a local tournament. You all win some money in the local tournament. How do you split the winnings? Evenly? Maybe. Or you could split the winnings based on contribution to the team. The star player gets the most, followed by the second best player, and so on all the way down the team. That is the idea of a Shapley value in game theory.

The Shapley value in machine learning is the average marginal contribution of a variable / feature (teammate) across all possible coalitions of variables (possible combinations of teammates). For this we would need to compute the average change in the prediction that a set of variables experiences when the variable of interest is added to the coalition. This computation is done across all observations and across all possible combinations of variables. This calculation can be very time consuming to do with large numbers of variables. There has been a tremendous amount of research in machine learning on how to accomplish this calculation much quicker, such as sampling, subsets of variables, etc.

The creators of Shapley values in machine learning tout four factors for Shapley values that, they argue, makes it one of the best local interpreters.

  1. Efficiency - variable contributions must sum to the difference of the prediction for the point of interest compared to the average prediction.
  2. Symmetry - contributions of two variables should be the same if they contribute equally to all possible combinations of variables.
  3. Dummy - a variable that does not change the predicted value, for any combination of variables, should have a Shapley value of 0.
  4. Additivity - for a forest of trees, the Shapley value of the forest for a given observation should be the average of the Shapley values for each tree at that given point.

Let’s see this in each of our softwares!

Cautions

There are some things to be cautious about with Shapley values. Some people try to look at the “overall impact” of a variable across all of the observations. One should be very careful with this approach as Shapley values were designed for local interpretations, not necessarily global ones.

For example, in Python we can use the shap.plots.bar function on all of the Shapley values instead of a specific one. This plot will show us the average of the absolute value of all of the Shapley values for a variable.

Code
shap.plots.bar(shap_values, max_display = 16)

One potentially valuable piece of information from the plot above is to see that the variable Gr_Liv_Area has the largest average magnitude of impact compared to the other variables. However, this number doesn’t necessarily capture the variability of the impact. We should not assume that all of the impacts are positive (remember, this is an absolute value) or that all of the observations have this specific impact. We saw in our two examples above that some variables have a positive and negative effect on different observations and there are definitely different magnitudes of impact of the same variable on different observations.

Another common plot you will see with Shapley values is the bee swarm plot from the shap.plots.beeswarm function applied to all Shapley values. This tries to show the variability of the impact of each variable across many different observations. Really, the only value form this plot is that notion of variation, but rarely have I noticed clients understanding or needing this plot. One should be hesitant to share this plot with any non-technical audience.

Code
shap.plots.beeswarm(shap_values, max_display = 16)