In two of my previous blog posts, I explained how the black box of a random forest can be opened up by tracking decision paths along the trees and computing feature contributions. This way, any prediction can be decomposed into contributions from features, such that \(prediction = bias + feature_1contribution+..+feature_ncontribution\).

However, this linear breakdown is inherently imperfect, since a linear combination of features cannot capture interactions between them. A classic example of a relation where a linear combination of inputs cannot capture the output is exclusive or (XOR), defined as

X1 | X2 | OUT |
---|---|---|

0 | 0 | 0 |

0 | 1 | 1 |

1 | 0 | 1 |

1 | 1 | 0 |

In this case, neither X1 nor X2 provide anything towards predicting the outcome in isolation. Their value only becomes predictive in conjunction with the the other input feature.

A decision tree can easily learn a function to classify the XOR data correctly via a two level tree (depicted below). However, if we consider feature contributions at each node, then at first step through the tree (when we have looked only at X1), we haven’t yet moved away from the bias, so the best we can predict at that stage is still “don’t know”, i.e. 0.5. And if we would have to write out the contribution from the feature at the root of the tree, we would (incorrectly) say that it is 0. After the next step down the tree, we would be able to make the correct prediction, at which stage we might say that the second feature provided all the predictive power, since we can move from a coin-flip (predicting 0.5), to a concrete and correct prediction, either 0 or 1. But of course attributing this to only the second level variable in the tree is clearly wrong, since the contribution comes from both features and should be equally attributed to both.

This information is of course available along the tree paths. We simply should gather together all conditions (and thus features) along the path that lead to a given node.

As you can see, the contribution of the first feature at the root of the tree is 0 (*value* staying at 0.5), while observing the second feature gives the full information needed for the prediction. We can now combine the features along the decision path, and correctly state that X1 and X2 together create the contribution towards the prediction.

The joint contribution calculation is supported by v0.2 of the treeinterpreter package (clone or install via pip). Joint contributions can be obtained by passing the *joint_contributions* argument to the *predict *method, returning the triple [prediction, contributions, bias], where contribution is a mapping from tuples of feature indices to absolute contributions.

Here’s an example, comparing two datasets of the Boston housing data, and calculating which feature combinations contribute to the difference in estimated prices

from sklearn.ensemble import RandomForestRegressor import numpy as np from sklearn.datasets import load_boston from treeinterpreter import treeinterpreter as ti, utils boston = load_boston() rf = RandomForestRegressor() # We train a random forest model, ... rf.fit(boston.data[:300], boston.target[:300]) # take two subsets from the data ... ds1 = boston.data[300:400] ds2 = boston.data[400:] # and check what the predicted average price is print (np.mean(rf.predict(ds1))) print (np.mean(rf.predict(ds2)))

21.9329

17.8863207547

The average predicted price is different for the two datasets. We can break down why and check the joint feature contribution for both datasets.

prediction1, bias1, contributions1 = ti.predict(rf, ds1, joint_contribution=True) prediction2, bias2, contributions2 = ti.predict(rf, ds2, joint_contribution=True)

Since biases are equal for both datasets (because the the model is the same), the difference between the average predicted values has to come only from (joint) feature contributions. In other words, the sum of the feature contribution differences should be equal to the difference in average prediction.

We can make use of the *aggregated_contributions* convenience method which takes the contributions for individual predictions and aggregates them together for the whole dataset

aggregated_contributions1 = utils.aggregated_contribution(contributions1) aggregated_contributions2 = utils.aggregated_contribution(contributions2) print (np.sum(list(aggregated_contributions1.values())) - np.sum(list(aggregated_contributions2.values()))) print (np.mean(prediction1) - np.mean(prediction2))

4.04657924528

4.04657924528

Indeed we see that the contributions exactly match the difference, as they should.

Finally, we can check which feature combination contributed by how much to the difference of the predictions in the too datasets:

res = [] for k in set(aggregated_contributions1.keys()).union( set(aggregated_contributions2.keys())): res.append(([boston["feature_names"][index] for index in k] , aggregated_contributions1.get(k, 0) - aggregated_contributions2.get(k, 0))) for lst, v in (sorted(res, key=lambda x:-abs(x[1])))[:10]: print (lst, v)

(['RM', 'LSTAT'], 2.0317570671740883)

(['RM'], 0.69252072064203141)

(['CRIM', 'RM', 'LSTAT'], 0.37069750747155134)

(['RM', 'AGE'], 0.11572468903150034)

(['INDUS', 'RM', 'AGE', 'LSTAT'], 0.054158313631716165)

(['CRIM', 'RM', 'AGE', 'LSTAT'], -0.030778806073267474)

(['CRIM', 'RM', 'PTRATIO', 'LSTAT'], 0.022935961564662693)

(['CRIM', 'INDUS', 'RM', 'AGE', 'TAX', 'LSTAT'], 0.022200426774483421)

(['CRIM', 'RM', 'DIS', 'LSTAT'], 0.016906509656987388)

(['CRIM', 'INDUS', 'RM', 'AGE', 'LSTAT'], -0.016840238405056267)

The majority of the delta came from the feature for number of rooms (RM), in conjunction with demographics data (LSTAT).

## Summary

Making random forest predictions interpretable is pretty straightforward, leading to a similar level of interpretability as linear models. However, in some cases, tracking the feature interactions can be important, in which case representing the results as a linear combination of features can be misleading. By using the *joint_contributions* keyword for prediction in the treeinterpreter package, one can trivially take into account feature interactions when breaking down the contributions.

Excellent library and series of posts, I’m looking at this library in my recent work.

I have a question:

We know that typical random forest measures of variable importance suffer under correlated variables and because of that typical variable importance measures don’t really generalize nicely, especially as compared to linear model coefficients.

However, if correlated variable importance is considered using conditional importance then the variable importance reflects a more accurate picture of what’s going on.

My question is this (and probably obvious, I apologize): in terms of interpretability, can `treeinterpreter`, with joint_contributions, reflect variable importance through variable contribution to the learning problem without bias or undue distortion; are contributions in this case really as interpretable and analogous to coefficients in linear regression?

Not sure what you mean by not generalizing nicely when compared to linear model coefficients. Under correlated variables, linear model coefficients are notoriously difficult to interepret, see http://blog.datadive.net/selecting-good-features-part-ii-linear-models-and-regularization/ for example

Sorry, my last sentence wasn’t clear. To clarify, “… are contributions in this case really as interpretable and analogous [or more so] to coefficients in linear regression?”

In short, say someone was trying to rank order model variables in terms importance. For linear regression they might just rank order variable coefficients. For random forests, my question is can one just rank order variable contributions as a proxy for variable importance?

Variable importance in this context is about the model itself: which features in general/on average tend to contribute to the prediction the most.

Feature contributions already take into account both the model and the test data, telling you exactly how much each feature contributes given (a) particular data point(s). So feature contribution can indeed be thought of as feature importance for a given test data.

Great, I appreciate the followup and explanation, thanks!