43

Explain Any Models with the SHAP Values — Use the KernelExplainer

 4 years ago
source link: https://towardsdatascience.com/explain-any-models-with-the-shap-values-use-the-kernelexplainer-79de9464897a?gi=c9d9e7f6d8b4
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
neoserver,ios ssh client

Explain Any Models with the SHAP Values — Use the KernelExplainer

Nov 7 ·13min read

Use the KernelExplainer for the SHAP Values

VbMVf2y.png!web

Since I published the article “ Explain Your Model with the SHAP Values ” that was built on a random forest tree, readers have been asking if there is one single SHAP approach for any ML algorithm — either tree-based or non-tree-based algorithms. That’s exactly what the KernelExplainer , a model-agnostic method, is designed to do. In the post, I will demonstrate how to use the KernelExplainer for models built in KNN, SVM, Random Forest, GBM, or the H2O module. If you want to get more background for the SHAP values, I strongly recommend “ Explain Your Model with the SHAP Values ”, in which I describe carefully how the SHAP values emerge from the Shapley value, what the Shapley value in Game Theory, and how the SHAP values work in Python.

Use the SHAP Values to Interpret Your Sophisticated Model

Consider this question: “Is your sophisticated machine learning model easy to understand?” That means your model can be understood by input variables that make business sense. Your variables will fit the expectations of users that they have learned from prior knowledge.

Lundberg et al. in their brilliant paper “ A unified approach to interpreting model predictions ” proposed the SHAP (SHapley Additive exPlanations) values which offer a high level of interpretability for a model. The SHAP values provide two great advantages:

  1. Global interpretability — the SHAP values can show how much each predictor contributes, either positively or negatively, to the target variable. This is like the variable importance plot but it is able to show the positive or negative relationship for each variable with the target (see the summary plots below).
  2. Local interpretability — each observation gets its own set of SHAP values (see the individual force plots below). This greatly increases its transparency. We can explain why a case receives its prediction and the contributions of the predictors. Traditional variable importance algorithms only show the results across the entire population but not on each individual case. The local interpretability enables us to pinpoint and contrast the impacts of the factors.

The SHAP values can be produced by the Python module SHAP .

What Does the KernelExplainer Do?

The KernelExplainer builds a weighted linear regression by using your data, your predictions, and whatever function that predicts the predicted values. It computes the variable importance values based on the Shapley values from game theory, and the coefficients from a local linear regression.

The drawback of the KernelExplainer is its long running time. If your model is a tree-based machine learning model, you should use the tree explainer TreeExplainer() that has been optimized to render fast results. If your model is a deep learning model, use the deep learning explainer DeepExplainer() . The SHAP Python module does not yet have specifically optimized algorithms for all types of algorithms (such as KNNs).

In “ Explain Your Model with the SHAP Values ” I use the function TreeExplainer() for a random forest model. To let you compare the results, I will use the same data source but use the function KernelExplainer() .

I will repeat the following four plots for all of the algorithms:

summary_plot()
dependence_plot()
force_plot()
force_plot()

Random Forest

First, let’s load the same data that was used in “ Explain Your Model with the SHAP Values ”.

import pandas as pd
import numpy as np
np.random.seed(0)
import matplotlib.pyplot as plt
df = pd.read_csv('/winequality-red.csv') # Load the data
from sklearn.model_selection import train_test_split
from sklearn import preprocessing
from sklearn.ensemble import RandomForestRegressor
# The target variable is 'quality'.
Y = df['quality']
X =  df[['fixed acidity', 'volatile acidity', 'citric acid', 'residual sugar','chlorides', 'free sulfur dioxide', 'total sulfur dioxide', 'density','pH', 'sulphates', 'alcohol']]
# Split the data into train and test data:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.2)

Let’s build a random forest model and print out the variable importance.

rf = RandomForestRegressor(max_depth=6, random_state=0, n_estimators=10)
rf.fit(X_train, Y_train)  
print(rf.feature_importances_)importances = rf.feature_importances_
indices = np.argsort(importances)features = X_train.columns
plt.title('Feature Importances')
plt.barh(range(len(indices)), importances[indices], color='b', align='center')
plt.yticks(range(len(indices)), [features[i] for i in indices])
plt.xlabel('Relative Importance')
plt.show()

rMRRreq.png!web

Figure (A) Random Forest Variable Importance Plot

The function KernelExplainer() below performs a local regression by taking the prediction method rf.predict and the data that you want to perform the SHAP values. Here I use the test dataset X_test which has 160 observations. This step can take a while.

import shap
rf_shap_values = shap.KernelExplainer(rf.predict,X_test)
  1. The summary plot

This plot has loaded information. The biggest difference of this plot with the regular variable importance plot (Figure A) is that it shows the positive and negative relationships of the predictors with the target variable. It looks dotty because it is made of all the dots in the train data. Let me walk you through:

shap.summary_plot(rf_shap_values, X_test)

2ABVfir.png!web

  • Feature importance: Variables are ranked in descending order.
  • Impact: The horizontal location shows whether the effect of that value is associated with a higher or lower prediction .
  • Original value: Color shows whether that variable is high (in red) or low (in blue) for that observation.
  • Correlation: A high level of the “alcohol” content has a high and positive impact on the quality rating. The “high” comes from the red color, and the “positive” impact is shown on the X-axis. Similarly, we will say “volatile acidity” is negatively correlated with the target variable.

2. The dependence plot

The partial dependence plot , or short for the dependence plot, is an important plot in machine learning outcome ( J. H. Friedman 2001 ). It shows the marginal effect that one or two variables have on the predicted outcome. It tells whether the relationship between the target and the variable is linear, monotonic or more complex. Suppose we want to get the dependence plot of “alcohol”. The Python module SHAP includes automatically another variable that “alcohol” interacts most with. The following plot shows that there is an approximately linear and positive trend between “alcohol” and the target variable, and “alcohol” interacts with “residual sugar” frequently.

shap.dependence_plot("alcohol", rf_shap_values, X_test)

mYfmYji.png!web

3. The individual force plot

You can produce a very elegant plot for each observation called the force plot . I arbitrarily chose the 10th observation of the X_test data. Below are the average values of X_test, and the values of the 10th observation.

VjA32yi.png!web

Pandas uses .iloc() to subset the rows of a data frame like the base R does. For other language developers you can read my post “ Are you Bilingual? Be Fluent in R and Python ” in which I compare the most common data wrangling tasks in R dply and Python Pandas.

# plot the SHAP values for the 10th observation 
shap.force_plot(rf_explainer.expected_value, rf_shap_values[10,:], X_test.iloc[10,:])
  • The output value is the prediction for that observation (the prediction for this observation is 5.11).
  • The base value : The original paper explains that the base value E(y_hat) is “the value that would be predicted if we did not know any features for the current output.” In other words, it is the mean prediction, or mean(yhat). You may wonder why it is 5.634. This is because the mean prediction of Y_test is 5.634. You can test it out by Y_test.mean() which produces 5.634.
  • Red/blue : Features that push the prediction higher (to the right) are shown in red, and those pushing the prediction lower are in blue.
  • Alcohol: has positive impact on the quality rating. The alcohol of this wine is 9.4 which is higher than the average value 10.48. So it pushes the prediction to the right.
  • Total sulfur dioxide: is positively related to the quality rating. A higher-than-the-average sulfur dioxide (= 18 > 14.98) pushes the prediction to the left.
  • The plot is centered on the x-axis at explainer.expected_value . All SHAP values are relative to the model's expected value like a linear model's effects are relative to the intercept.

4. The collective force plot

Each observation has its own force plot. If all the force plots are combined, rotated 90 degrees and stacked horizontally, we get the force plot of the entire data X_test (see the explanation of the github of Lundberg and other contributors).

shap.force_plot(rf_explainer.expected_value, rf_shap_values, X_test)

nERvue6.png!web

The above Y axis is the X axis of the individual force plot. There are 160 data points in our X_test, so the X axis has 160 observations.

GBM

I built the GBM with 500 trees (the default is 100) that should be fairly robust against over-fitting. I specify 20% of the training data for early stopping by using the hyper-parameter validation_fraction=0.2 . This hyper-parameter, together with n_iter_no_change=5 will help the model to stop earlier if the validation result is not improving after 5 times.

from sklearn import ensemble
n_estimators = 500
gbm = ensemble.GradientBoostingClassifier(
            n_estimators=n_estimators,
            validation_fraction=0.2,
            n_iter_no_change=5, 
            tol=0.01,
            random_state=0)
gbm = ensemble.GradientBoostingClassifier(
            n_estimators=n_estimators,
            random_state=0)
gbm.fit(X_train, Y_train)

Like the random forest section above, I use the function KernelExplainer() to generate the SHAP values. Then I will provide four plots.

import shap
gbm_shap_values = shap.KernelExplainer(gbm.predict,X_test)
  1. The summary plot

When compared with the output of the random forest, GBM shows the same variable ranking for the first four variables but differs for the rest variables.

shap.summary_plot(gbm_shap_values, X_test)

uAzuUjR.png!web

2. The dependence plot

The dependence plot of GBM also shows that there is an approximately linear and positive trend between “alcohol” and the target variable. In contrast to the output of the random forest, and GBM shows that “alcohol” interacts with the “density” frequently.

shap.dependence_plot("alcohol", gbm_shap_values, X_test)

uMzeEjU.png!web

3. The individual force plot

I continue to produce the force plot for the 10th observation of the X_test data.

# plot the SHAP values for the 10th observation 
shap.force_plot(gbm_explainer.expected_value,gbm_shap_values[10,:], X_test.iloc[10,:]) 

The prediction of GBM for this observation is 5.00, different from 5.11 by the random forest. The forces that drive the prediction lowers are similar to those of the random forest: alcohol, sulphates and residual sugar. But the force to drive the prediction up is different.

4. The collective force plot

shap.force_plot(gbm_explainer.expected_value, gbm_shap_values, X_test)

uEfQ732.png!web

KNN

Because the goal here is to demonstrate the SHAP values, I just set the KNN 15 neighbors without diligent model optimizations for the KNN.

# Train the KNN model
from sklearn import neighbors
n_neighbors = 15
knn = neighbors.KNeighborsClassifier(n_neighbors,weights='distance')
knn.fit(X_train,Y_train)# Produce the SHAP values
knn_explainer = shap.KernelExplainer(knn.predict,X_test)
knn_shap_values = knn_explainer.shap_values(X_test)
  1. The summary plot

Interestingly the KNN shows a different variable ranking when compared with the output of the random forest or GBM. This departure is expected because KNN is prone to outliers and here we only train a KNN model. To mitigate the problem, you are advised to build several KNN models with different number of neighbors, then get the averages. This intuition is also shared in my article “ Anomaly Detection with PyOD ”.

shap.summary_plot(knn_shap_values, X_test)

iyYJFbQ.png!web

2. The dependence plot

The output of the KNN shows that there is an approximately linear and positive trend between “alcohol” and the target variable. Different from the output of the random forest, the KNN shows that “alcohol” interacts with “total sulfur dioxide” frequently.

shap.dependence_plot("alcohol", knn_shap_values, X_test)

QvMZ3y6.png!web

3. The individual force plot

# plot the SHAP values for the 10th observation 
shap.force_plot(knn_explainer.expected_value,knn_shap_values[10,:], X_test.iloc[10,:])

The prediction for this observation is 5.00 which is similar to that of GBM. The driving forces identified by the KNN are: “free sulfur dioxide”, “alcohol” and “residual sugar”.

4. The collective force plot

shap.force_plot(knn_explainer.expected_value, knn_shap_values, X_test)

uM3MZvF.png!web

SVM

A Support Vector Machine (AVM) finds the optimal hyperplane to separate observations into classes. The SVM uses kernel functions to transform into a higher dimensional space for the separation. Why does the separation become easier in a higher dimensional space? This has to go back to the Vapnik-Chervonenkis (VC) theory. It says mapping into a higher dimensional space often provides greater classification power. See my post “ Dimension Reduction Techniques with Python ” for further explanation. The common kernel functions are Radial Basis Function (RBF), Gaussian, Polynomial, and Sigmoid.

In this example I use the Radial Basis Function (RBF) with the parameter gamma . When the value of gamma is very small, the model is too constrained and cannot capture the complexity or “shape” of the data. Two options are available gamma='auto' or gamma='scale' (see the scikit-learn api ).

Another important hyper-parameter is decision_function_shape . The hyper-parameter decision_function_shape tells SVM how close a data point to the hyperplane. A data point close to the boundary means a low-confidence decision. There are two options: one-vs-rest (‘ovr’) or one-vs-one (‘ovo’) (see the scikit-learn api ).

# Build the SVM model
from sklearn import svm
svm = svm.SVC(gamma='scale',decision_function_shape='ovo')
svm.fit(X_train,Y_train)# The SHAP values
svm_explainer = shap.KernelExplainer(svm.predict,X_test)
svm_shap_values = svm_explainer.shap_values(X_test)
  1. The summary plot

Here again we see a different summary plot from the output of the random forest and GBM. This is expected because we only train one SVM model and SVM is also prone to outliers.

shap.summary_plot(svm_shap_values, X_test)

Ab2eAbF.png!web

2. The dependence plot

The output of the SVM shows a mild linear and positive trend between “alcohol” and the target variable. In contrast to the output of the random forest, the SVM shows that “alcohol” interacts with “fixed acidity” frequently.

shap.dependence_plot("alcohol", svm_shap_values, X_test)

IneE3mQ.png!web

3. The individual force plot

# plot the SHAP values for the 10th observation 
shap.force_plot(svm_explainer.expected_value,svm_shap_values[10,:], X_test.iloc[10,:])

The prediction of SVM for this observation is 6.00, different from 5.11 by the random forest. The forces that drive the prediction lower are similar to those of the random forest; in contrast, “total sulfur dioxide” is a strong force to drive the prediction up.

4. The collective force plot

shap.force_plot(svm_explainer.expected_value, svm_shap_values, X_test)

UJzq2ez.png!web

Models Built in Open-Source H2O

Many data scientists (including myself) love the open-source H2O . It is a fully distributed in-memory platform that supports the most widely used algorithms such as the GBM, RF, GLM, DL, and so on. Its AutoML function automatically runs through all the algorithms and their hyper-parameters to produce a leaderboard of the best models. Its enterprise version H2O Driverless AI has built-in SHAP functionality.

How to apply the SHAP values with the open-source H2O? I am indebted to seanPLeary who has contributed to the H2O community on how to produce the SHAP values with AutoML. I use his class “H2OProbWrapper” to calculate the SHAP values.

# The code builds a random forest model
import h2o
from h2o.estimators.random_forest import H2ORandomForestEstimator
h2o.init()X_train, X_test = train_test_split(df, test_size = 0.1)
X_train_hex = h2o.H2OFrame(X_train)
X_test_hex = h2o.H2OFrame(X_test)
X_names =  ['fixed acidity', 'volatile acidity', 'citric acid', 'residual sugar','chlorides', 'free sulfur dioxide', 'total sulfur dioxide', 'density','pH', 'sulphates', 'alcohol']# Define model
h2o_rf = H2ORandomForestEstimator(ntrees=200, max_depth=20, nfolds=10)# Train model
h2o_rf.train(x=X_names, y='quality', training_frame=X_train_hex)X_test = X_test_hex.drop('quality').as_data_frame()

Let’s take a closer look for the SVM’s code shap.KernelExplainer(svm.predict, X_test) . It takes the function predict of the class svm , and the dataset X_test . So when we apply to the H2O we need to pass (i) the predict function, (ii) a class and (iii) a dataset. What’s tricky is that H2O has its own dataframe structure. In order to pass h2O’s predict function h2o.preict() to shap.KernelExplainer() , seanPLeary wraps H2O’s predict function h2o.preict() in a class named H2OProbWrapper . This nice wrapper allows shap.KernelExplainer() to take the function predict of the class H2OProbWrapper , and the dataset X_test .

class H2OProbWrapper:
    def __init__(self, h2o_model, feature_names):
        self.h2o_model = h2o_model
        self.feature_names = feature_namesdef predict_binary_prob(self, X):
        if isinstance(X, pd.Series):
            X = X.values.reshape(1,-1)
        self.dataframe= pd.DataFrame(X, columns=self.feature_names)
        self.predictions = self.h2o_model.predict(h2o.H2OFrame(self.dataframe)).as_data_frame().values
        return self.predictions.astype('float64')[:,-1] 

So we will compute the SHAP values for the H2O random forest model:

h2o_wrapper = H2OProbWrapper(h2o_rf,X_names)h2o_rf_explainer = shap.KernelExplainer(h2o_wrapper.predict_binary_prob, X_test)
h2o_rf_shap_values = h2o_rf_explainer.shap_values(X_test)
  1. The summary plot

When compared with the output of the random forest, The H2O random forest shows the same variable ranking for the first three variables.

shap.summary_plot(h2o_rf_shap_values, X_test)

yYJVnaE.png!web

2. The dependence plot

The output shows that there is a linear and positive trend between “alcohol” and the target variable. The H2O Random Forest identifies “alcohol” interacting with “citric acid” frequently.

shap.dependence_plot("alcohol", h2o_rf_shap_values, X_test)

MjEra2U.png!web

3. The individual force plot

The prediction of the H2O Random Forest for this observation is 6.07. The forces driving the prediction to the right are “alcohol”, “density”, “residual sugar” and “total sulfur dioxide”; to the left are “fixed acidity” and “sulphates”.

# plot the SHAP values for the 10th observation 
shap.force_plot(h2o_rf_explainer.expected_value,h2o_rf_shap_values[10,:], X_test.iloc[10,:])

4. The collective force plot

shap.force_plot(h2o_rf_explainer.expected_value, h2o_rf_shap_values, X_test)

2eeyqmM.png!web

How About SHAP Values in R?

It is interesting to mention a few R packages for the SHAP values here. The R package shapper is a port of the Python library SHAP. The R package xgboost have built-in function. Another package is iml (Interpretable Machine Learning). Finally the R package DALEX (Descriptive mAchine Learning EXplanations) also contains various explainers that help to understand the link between input variables and model output.


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK