In two of my previous blog posts, I explained how the black box of a random forest can be opened up by tracking decision paths along the trees and computing feature contributions. This way, any prediction can be decomposed into contributions from features, such that \(prediction = bias + feature_1contribution+..+feature_ncontribution\).

However, this linear breakdown is inherently imperfect, since a linear combination of features cannot capture interactions between them. A classic example of a relation where a linear combination of inputs cannot capture the output is exclusive or (XOR), defined as

X1 | X2 | OUT |
---|---|---|

0 | 0 | 0 |

0 | 1 | 1 |

1 | 0 | 1 |

1 | 1 | 0 |

In this case, neither X1 nor X2 provide anything towards predicting the outcome in isolation. Their value only becomes predictive in conjunction with the the other input feature.

A decision tree can easily learn a function to classify the XOR data correctly via a two level tree (depicted below). However, if we consider feature contributions at each node, then at first step through the tree (when we have looked only at X1), we haven’t yet moved away from the bias, so the best we can predict at that stage is still “don’t know”, i.e. 0.5. And if we would have to write out the contribution from the feature at the root of the tree, we would (incorrectly) say that it is 0. After the next step down the tree, we would be able to make the correct prediction, at which stage we might say that the second feature provided all the predictive power, since we can move from a coin-flip (predicting 0.5), to a concrete and correct prediction, either 0 or 1. But of course attributing this to only the second level variable in the tree is clearly wrong, since the contribution comes from both features and should be equally attributed to both.

This information is of course available along the tree paths. We simply should gather together all conditions (and thus features) along the path that lead to a given node.

As you can see, the contribution of the first feature at the root of the tree is 0 (*value* staying at 0.5), while observing the second feature gives the full information needed for the prediction. We can now combine the features along the decision path, and correctly state that X1 and X2 together create the contribution towards the prediction.

The joint contribution calculation is supported by v0.2 of the treeinterpreter package (clone or install via pip). Joint contributions can be obtained by passing the *joint_contributions* argument to the *predict *method, returning the triple [prediction, contributions, bias], where contribution is a mapping from tuples of feature indices to absolute contributions.

Here’s an example, comparing two datasets of the Boston housing data, and calculating which feature combinations contribute to the difference in estimated prices

from sklearn.ensemble import RandomForestRegressor import numpy as np from sklearn.datasets import load_boston from treeinterpreter import treeinterpreter as ti, utils boston = load_boston() rf = RandomForestRegressor() # We train a random forest model, ... rf.fit(boston.data[:300], boston.target[:300]) # take two subsets from the data ... ds1 = boston.data[300:400] ds2 = boston.data[400:] # and check what the predicted average price is print (np.mean(rf.predict(ds1))) print (np.mean(rf.predict(ds2)))

21.9329

17.8863207547

The average predicted price is different for the two datasets. We can break down why and check the joint feature contribution for both datasets.

prediction1, bias1, contributions1 = ti.predict(rf, ds1, joint_contribution=True) prediction2, bias2, contributions2 = ti.predict(rf, ds2, joint_contribution=True)

Since biases are equal for both datasets (because the the model is the same), the difference between the average predicted values has to come only from (joint) feature contributions. In other words, the sum of the feature contribution differences should be equal to the difference in average prediction.

We can make use of the *aggregated_contributions* convenience method which takes the contributions for individual predictions and aggregates them together for the whole dataset

aggregated_contributions1 = utils.aggregated_contribution(contributions1) aggregated_contributions2 = utils.aggregated_contribution(contributions2) print (np.sum(list(aggregated_contributions1.values())) - np.sum(list(aggregated_contributions2.values()))) print (np.mean(prediction1) - np.mean(prediction2))

4.04657924528

4.04657924528

Indeed we see that the contributions exactly match the difference, as they should.

Finally, we can check which feature combination contributed by how much to the difference of the predictions in the too datasets:

res = [] for k in set(aggregated_contributions1.keys()).union( set(aggregated_contributions2.keys())): res.append(([boston["feature_names"][index] for index in k] , aggregated_contributions1.get(k, 0) - aggregated_contributions2.get(k, 0))) for lst, v in (sorted(res, key=lambda x:-abs(x[1])))[:10]: print (lst, v)

(['RM', 'LSTAT'], 2.0317570671740883)

(['RM'], 0.69252072064203141)

(['CRIM', 'RM', 'LSTAT'], 0.37069750747155134)

(['RM', 'AGE'], 0.11572468903150034)

(['INDUS', 'RM', 'AGE', 'LSTAT'], 0.054158313631716165)

(['CRIM', 'RM', 'AGE', 'LSTAT'], -0.030778806073267474)

(['CRIM', 'RM', 'PTRATIO', 'LSTAT'], 0.022935961564662693)

(['CRIM', 'INDUS', 'RM', 'AGE', 'TAX', 'LSTAT'], 0.022200426774483421)

(['CRIM', 'RM', 'DIS', 'LSTAT'], 0.016906509656987388)

(['CRIM', 'INDUS', 'RM', 'AGE', 'LSTAT'], -0.016840238405056267)

The majority of the delta came from the feature for number of rooms (RM), in conjunction with demographics data (LSTAT).

## Summary

Making random forest predictions interpretable is pretty straightforward, leading to a similar level of interpretability as linear models. However, in some cases, tracking the feature interactions can be important, in which case representing the results as a linear combination of features can be misleading. By using the *joint_contributions* keyword for prediction in the treeinterpreter package, one can trivially take into account feature interactions when breaking down the contributions.

Excellent library and series of posts, I’m looking at this library in my recent work.

I have a question:

We know that typical random forest measures of variable importance suffer under correlated variables and because of that typical variable importance measures don’t really generalize nicely, especially as compared to linear model coefficients.

However, if correlated variable importance is considered using conditional importance then the variable importance reflects a more accurate picture of what’s going on.

My question is this (and probably obvious, I apologize): in terms of interpretability, can `treeinterpreter`, with joint_contributions, reflect variable importance through variable contribution to the learning problem without bias or undue distortion; are contributions in this case really as interpretable and analogous to coefficients in linear regression?

Not sure what you mean by not generalizing nicely when compared to linear model coefficients. Under correlated variables, linear model coefficients are notoriously difficult to interepret, see http://blog.datadive.net/selecting-good-features-part-ii-linear-models-and-regularization/ for example

Sorry, my last sentence wasn’t clear. To clarify, “… are contributions in this case really as interpretable and analogous [or more so] to coefficients in linear regression?”

In short, say someone was trying to rank order model variables in terms importance. For linear regression they might just rank order variable coefficients. For random forests, my question is can one just rank order variable contributions as a proxy for variable importance?

Variable importance in this context is about the model itself: which features in general/on average tend to contribute to the prediction the most.

Feature contributions already take into account both the model and the test data, telling you exactly how much each feature contributes given (a) particular data point(s). So feature contribution can indeed be thought of as feature importance for a given test data.

Great, I appreciate the followup and explanation, thanks!

Hi there! Thanks for this really cool blog post with excellent illustrations of the topic. I was wondering whether the size of the contribution value depends on the values of the features similar to coefficients in linear regression. Can I just go by the absolute contribution value that treeinterpreter gives me to sort the features by contribution? Thanks in advance!

I’m not a coder, I have trouble when i do it. Should I hire a coder?

Thank you for this library and clear explanation !

I have some questions about joint contribution:

– how exactly is it calculated ?

– If a contribution of x1 is 0.05 and x2 is 0.001 for ex. and their joint contribution (x1, x2) :0.12. Does it mean that these two variables interact between them?

Thank you in advance !

An excellent series of posts in your library indeed. Even understandable to me, and I am a precision engineer!

Hello

Firstly great thanks to the author who helps me to interprete the mechanism of RF!

And i found something that confuses me during my application of this ”treeinterpreter” in regression:

prediction=bias+feature1contribution+..+featurencontribution.

the bias, known as the mean value of the training set, is calculated in the treeinterpreter like this:

biases = np.full(X.shape[0], values[paths[0][0]])

However, in my several trials, this bias is slightly different from the real mean value of the training set.

Have you ever met this problem? Should i modify the calculation of bias by hand?Like adding some compensation like: biases = np.full(X.shape[0], values[paths[0][0]]+p), p equals some value that makes bias equal to the real mean value.

Thank you in advance.

Hi

If you use bootstrap in your random forest (which you do by default), then indeed the bias doesn’t necessarily exactly match original trainset mean, because the bootstrap sample set for each tree will have slightly different means (and in the end they are aggregated).

In other words, bias is the mean of “real” training set of the tree as it is trained in scikit-learn, which isn’t necessarily exactly the same as the mean for the original training set, due to bootstrap.

The given bias shouldn’t be adjusted, it is in fact the correct one for the given model.

Pingback: Random forest interpretation – conditional feature contributions | Premium Blog!

Hi,

Thanks for all the great work.

I noted that ExtraTreesClassifier models will work in the readme, but when one is supplied, it triggers the value error looking for a DTclassifier or DT regressor.

Are the ExtraTreesClassifier models not yet supported?

Thanks!

Pingback: Interpreting Random Forest and other black box models like XGBoost - Coding Videos

Thanks for sharing it.

An inciteful and easy to understand summary. Great for practical learners. Thanks!

Based on your discussion, it seems like for a tree of depth n, the number of interaction terms scale as n factorial. For example, for the path 1->2->3 through the tree, (1,2), (2,3) and (1,2,3) are interactions. For the path 1->2->3->4, (1,2), (2,3), (3,4), (1,2,3), (2,3,4) and (1,2,3,4) are interactions. Furthermore, the interactions should nest, i.e. (1,2) is nested in (1,2,3), which is nested in (1,2,3,4).

1. Would there be a way to specify computing up to pairwise or triplets of interactions in treeinterpreter?

2. I tried running treeinterpreter with joint_contributions = True, and for each instance, the interaction terms did not nest, i.e. are not hierarchical and contain many non-overlapping members. How is this possible, if this is a single instance going through the tree?

Pingback: Explaining Feature Importance by example of a Random Forest | Coding Videos

Hi

Greeting and Regards

At first, thanks for learning and explain.

In your data set, you have some samples that each sample contains a number of attributes. for example, we have 100 samples that each sample contain 30 attributes.

My question is whether can we use this algorithm for a data set that has 100 samples with 30 attributes, Each feature has three parts? ,

i,e: we have a population of samples, that each sample contain 56 feature and each feature contains 3 parts.

The 17 tournois du Grand Chelem Champion, dont le dernier Open d’Australie titre est venu en 2010 quand il a vaincu Andy Murray en finale, est confiant position dans le tournoi, en disant qu’il a été au service ainsi que des fin.