Interpreting random forests

Posted October 19, 2014

Why model interpretation?

Imagine a situation where a credit card company has built a fraud detection model using a random forest. The model can classify every transaction as either valid or fraudulent, based on a large number of features. What if, after a transaction is classified as fraudulent, the analyst would like to know why the model made this decision, i.e. how much each feature contributed to the final outcome?

Or what if a random forest model that worked as expected on an old data set, is producing unexpected results on a new data set. How would one check which features contribute most to the change in the expected behaviour.

Random forest as a black box

Most literature on random forests and interpretable models would lead you to believe this is nigh impossible, since random forests are typically treated as a black box. Indeed, a forest consists of a large number of deep trees, where each tree is trained on bagged data using random selection of features, so gaining a full understanding of the decision process by examining each individual tree is infeasible. Furthermore, even if we are to examine just a single tree, it is only feasible in the case where it has a small depth and low number of features. A tree of depth 10 can already have thousands of nodes, meaning that using it as an explanatory model is almost impossible.

One way of getting an insight into a random forest is to compute feature importances, either by permuting the values of each feature one by one and checking how it changes the model performance or computing the amount of “impurity” (typically variance in case of regression trees and gini coefficient or entropy in case of classification trees) each feature removes when it is used in node. Both approaches are useful, but crude and static in the sense that they give little insight in understanding individual decisions on actual data.

Turning a black box into a white box: decision paths

When considering a decision tree, it is intuitively clear that for each decision that a tree (or a forest) makes there is a path (or paths) from the root of the tree to the leaf, consisting of a series of decisions, guarded by a particular feature, each of which contribute to the final predictions.
A decision tree with \(M\) leaves divides the feature space into \(M\) regions \(R_m, 1\leq m \leq M \). In the classical definition (see e.g. Elements of Statistical Learning), the prediction function of a tree is then defined as \(f(x) = \sum\limits_{m=1}^M c_m I(x, R_m)\) where \(M\) is the number of leaves in the tree(i.e. regions in the feature space), \(R_m\) is a region in the feature space (corresponding to leaf \(m\)), \(c_m\) is a constants corresponding to region \(m\) and finally \(I\) is the indicator function (returning 1 if \(x \in R_m\), 0 otherwise). The value of \(c_m\) is determined in the training phase of the tree, which in case of regression trees corresponds to the mean of the response variables of samples that belong to region \(R_m\) (or ratio(s) in case of a classification tree). The definition is concise and captures the meaning of tree: the decision function returns the value at the correct leaf of the tree. But it ignores the “operational” side of the decision tree, namely the path through the decision nodes and the information that is available there.

Example: Boston housing data

Let’s take the Boston housing price data set, which includes housing prices in suburbs of Boston together with a number of key attributes such as air quality (NOX variable below), distance from the city center (DIST) and a number of others – check the page for the full description of the dataset and the features. We’ll build a regression decision tree (of depth 3 to keep things readable) to predict housing prices. As usual, the tree has conditions on each internal node and a value associated with each leaf (i.e. the value to be predicted). But additionally we’ve plotted out the value at each internal node i.e. the mean of the response variables in that region.

RM	LSTAT	NOX	DIST
3.1	4.5	0.54	2.6	Predict
6.5	16.1	0.12	2.2	Predict
7.1	10.5	0.31	1.8	Predict

You can hover on the leaves of the tree or click “predict” in the table (which includes sample values from the data set) to see the decision paths that lead to each prediction.
What’s novel here is that you can see the breakdown of the prediction, written down in terms of value changes along the prediction path, together with feature names that “caused” every value change due to being in the guard (the numbers are approximate due to rounding).

What this example should make apparent is that there is another, a more “operational” way to define the prediction, namely through the sequence of regions that correspond to each node/decision in the tree. Since each decision is guarded by a feature, and the decision either adds or subtracts from the value given in the parent node, the prediction can be defined as the sum of the feature contributions + the “bias” (i.e. the mean given by the topmost region that covers the entire training set).
Without writing out the full derivation, the prediction function can be written down as \(f(x) = c_{full} + \sum\limits_{k=1}^K contrib(x, k)\) where \(K\) is the number of features, \(c_{full}\) is the value at the root of the node and \(contrib(x, k)\) is the contribution from the k-th feature in the feature vector \(x\). This is superficially similar to linear regression (\(f(x) = a + bx\)). For linear regression the coefficients \(b\) are fixed, with a single constant for every feature that determines the contribution. For the decision tree, the contribution of each feature is not a single predetermined value, but depends on the rest of the feature vector which determines the decision path that traverses the tree and thus the guards/contributions that are passed along the way.

From decision trees to forest

We started the discussion with random forests, so how do we move from a decision tree to a forest? This is straightforward, since the prediction of a forest is the average of the predictions of its trees: \(F(x) = \frac{1}{J} \sum\limits_{j=1}^J f_j(x) \), where \(J\) is the number of trees in the forest. From this, it is easy to see that for a forest, the prediction is simply the average of the bias terms plus the average contribution of each feature: \(F(x) = \frac{1}{J}{\sum\limits_{j=1}^J {c_{j}}_{full}} + \sum\limits_{k=1}^K (\frac{1}{J}\sum\limits_{j=1}^J contrib_j(x, k)) \).

Running the interpreter

Update (Aug 12, 2015)
Running the interpretation algorithm with actual random forest model and data is straightforward via using the treeinterpreter (pip install treeinterpreter) library that can decompose scikit-learn‘s decision tree and random forest model predictions. More information and examples available in this blog post.

Summary

There is a very straightforward way to make random forest predictions more interpretable, leading to a similar level of interpretability as linear models — not in the static but dynamic sense. Every prediction can be trivially presented as a sum of feature contributions, showing how the features lead to a particular prediction. This opens up a lot of opportunities in practical machine learning and data science tasks:

Explain to an analyst why a particular prediction is made
Debug models when results are unexpected
Explain the differences of two datasets (for example, behavior before and after treatment), by comparing their average predictions and corresponding average feature contributions.

60 comments on “Interpreting random forests”

Sujit R. Tangadpalliwar on December 24, 2014 at 6:22 am said:

Thank you sir for such a informative description.

Reply ↓
C F on December 30, 2014 at 2:54 pm said:

How did you create the great interactive visualization figure? Would you say your techniques are scalable to a large tree?

Reply ↓
ando on January 2, 2015 at 9:38 am said:

I created it using D3 (http://d3js.org/), a great Javascript visualization library.
You can see a lot of examples of tree visualizations at https://github.com/mbostock/d3/wiki/Gallery

As for large trees, the number of nodes grows exponentially in the depth of the tree. So a tree of depth 10 can already have ~2000 nodes. A tree of this size will be very difficult for a human to read, since there is simply too much too fine grained information there. But that’s where the usefulness of the described approach comes in, since it is agnostic to the size of the tree (or the number of trees in case of a forest).

Reply ↓
- sunny tomar on May 25, 2019 at 1:19 pm said:
  
  I am working on similar project , thanks for the wonderful explanation. Could you please share the code for designing the graph which highlights the path. Thanks in advance.
  
  Reply ↓
Matt on March 1, 2015 at 4:29 pm said:

I’m thinking this approach could also be adapted to gradient boosted trees, which are also (at least as I understand their implementation in SAS EM) an ensemble of a number of trees from bootstrapped samples of the data (but using all features vs. a sample of features) ? I’ve also seen examples of using trees to visualize neural nets.

Reply ↓
ando on March 4, 2015 at 12:22 am said:

Yes, it would indeed also work for gradient boosted trees in a similar way. Basically, any time the prediction is made via trees, the prediction can be broken down into a sum of feature contributions.

Reply ↓
- Soren Welling on June 8, 2016 at 11:00 am said:
  
  @Basically, any time the prediction is made via trees, the prediction can be broken down into a sum of feature contributions
  
  The definition of feature contributions should be modified for gradient boosting. The sum of decision paths (aka. local increments) should no longer be divided with number of trees, in order to maintain “prediction = bias + sum of feature contributions”. Each bagged tree maps from bias (aka. base rate +stratification or grand mean) to target and the ensemble prediction is the average vote and therefore division by number of trees. Each boosted tree only maps from residual to target, and the boosted ensemble maps only once from bias to target, therefore division by 1.
  
  I appended a short proof-of-concept for computing and visualizing feature contributions for gradient boosting with R in ancillary files for this paper, http://arxiv.org/abs/1605.09196
  
  Reply ↓
Bunyamin on March 9, 2015 at 8:29 pm said:

There is a typ0.

line 5 up from the last sentence. “anlyst” should be “analyst”.

Reply ↓
Bobbie on April 23, 2015 at 11:04 pm said:

Thanks to this post, I understood the ‘theorical equation’ behind Random Forest running.
Do you have a source where the equation came?
Thanks Again for everything,

Bobbie

Reply ↓
- ando on April 24, 2015 at 6:36 pm said:
  
  I have a fork of scikit-learn that implements calculating the decision paths for each prediction: https://github.com/andosa/scikit-learn/tree/tree_paths
  (decision_paths method in RandomForest). I’ll write a more detailed post on it once the pull request is merged back to sklearn.
  
  Reply ↓
  - Pedro on July 8, 2015 at 6:44 pm said:
    
    Hi Ando, any luck with this? I was wondering if we could maybe make a standalone module, should it not be merged.
    
    Reply ↓
    - ando on July 9, 2015 at 8:43 am said:
      
      In current 0.17dev, my commit to keep values in all nodes was merged. Additionally, a method to get the leaf labels when predicting was added. Combining these, the interpretation can be done on the 0.17dev version. Planning to write a blog post on this in the near future.
      
      Reply ↓
Zach on June 25, 2015 at 4:17 pm said:

This is great stuff Ando. I was thinking about how to apply this to ‘understand’ a whole dataset/model combination. You could, e.g., pick a few top features and cluster the entire population according to the feature contributions, for these features, from a RF model. On the Boston housing data, this leads to 8-10 clusters with clear descriptions like “Neighborhoods that are expensive because they are right near the city center”, or “Neighborhoods that are expensive because the houses are huge”. You could even then compare two data sets by fitting the clusters and seeing how the proportions change.

Thanks for the contribution – looking forward to seeing decision_paths in sklearn.

Reply ↓
Pingback: 使用scikit-learn解释随机森林算法 - IT大道
FS on May 24, 2016 at 4:15 pm said:

This is great! Do you know if this is available with the R random forest package?

Reply ↓
Chad on July 29, 2016 at 2:48 pm said:

What would it be the interpretation of a negative value, for a specific variable, in a classification problem? Does it mean that higher values of this variable decrease the predicted probability? I.e. Given Predicted_prob(x) = Bias + 0.01*A – 0.02*B, is it correct to assume that probability to belong to class X is inversely proportional to the value assumed by B?

Reply ↓
- ando on July 30, 2016 at 8:41 pm said:
  
  Remember, all of these breakdowns are exact contribution from features per datapoint/instance.
  So in your example, it means that for datapoint x, B reduces the probability. It doesnt mean that B always (or on average) reduces the probability. For some other datapoint, B could be positive.
  
  Reply ↓
  - Shilpa on November 22, 2019 at 8:43 am said:
    
    If for some datapoints B could be positive for some it could be negative; how do we interpret the contribution. I was under the impression that we will learn more about the features and how do they contribute to the respective classes from this exercise but that does not seem to be the case!
    Can you please explain.
    Thanks much,
    
    Reply ↓
Pingback: Random forest interpretation – conditional feature contributions | Diving into data
jayson pryde on January 11, 2017 at 6:15 pm said:

Great post! 🙂
Question though… Quoting this:

” For the decision tree, the contribution of each feature is not a single predetermined value, but depends on the rest of the feature vector which determines the decision path that traverses the tree and thus the guards/contributions that are passed along the way”

If in case I get the mean of the contributions of each feature for all the training data in my decision tree model, and then just use the linear regression f(x) = a + bx (where a is the mean bias and b is now the mean contributions) to do predictions for incoming data, do you think this will work?

Reply ↓
Patrick on February 15, 2017 at 10:09 pm said:

Hi – I would like to use the figure above in an O’Reilly media article about interpretable machine learning. This article would feature treeinterpreter among many other techniques. Please let me know ASAP. Thanks!

Reply ↓
- ando on February 16, 2017 at 5:04 pm said:
  
  It’s fine if it’s attributed correctly
  
  Reply ↓
  - Patrick on February 17, 2017 at 2:45 pm said:
    
    We will link to this blog. I followed you on twitter recently. Please let me know here or there if you would like any other specific citation.
    
    Reply ↓
Pingback: Ideas on interpreting machine learning | Vedalgo
Anonymous on May 22, 2017 at 12:07 pm said:

Thank You so much Sir!! It’s precisely as well as very well explained!!

Reply ↓
vivek on August 3, 2017 at 11:05 am said:

can we get black box rules in random forest(code) so I can use that in my new dataset also?

Reply ↓
Andreas on September 22, 2017 at 8:55 am said:

Thank you for this package, it is really great that it allows to open the random forest “blackbox”.

I have a quick question: I tested the package with some housing data to predict prices and I have a case where all the features are the same except the AreaLiving. Looking at the feature contributions however, they are different for all the features. I would have expected to get them the same, is that reasoning wrong?

For most cases the feature contributions are close together, but not the same. However, some are quite apart, like the rooms (- 96 vs. -44), even though they have the same number of rooms. Another case is the latitude (-452 vs -289).

Maybe the interpretation is: The small house with 5 rooms gets more substracted (-96) than the big house (-44) as you expect these rooms to be smaller? And for the latitude the small house gets a more negative contribution (-452) than the big house (-289) as in this latitude you can better sell a big house?

Many thanks in advanced for any help!

Best regards,
Andreas

Reply ↓
- ando on September 23, 2017 at 9:11 am said:
  
  Think of it this way: depending on the value of the root node in the decision tree, you can end up in the left or right branch. The left and right branch can use completely different features. Thus, simply by changing the value of the feature that’s in the root node, you might see contributions shift completely.
  
  For example the root node might be location (say city vs countryside). In the first case, the important features might be number of rooms and tax zone. In the second case, important features might be land lot size and number of floors. So simply switching up one feature (location), you would get completely different contributions.
  
  Reply ↓
Andreas on September 25, 2017 at 6:35 am said:

Hello Ando,

Thank you for your reply. That makes sense. I figured out as well that I had included some features with low importance that often triggered some bigger changes, removing them should help the model to return more stable contributions.

Many thanks for your help!

Reply ↓
julian on October 27, 2017 at 11:57 am said:

Hi, can you say something about how this applies to classification trees, as the examples you have given all relate to regression trees. Can you break down how the bias and contribution affect the chosen purity measure, say Gini index?

Reply ↓
- ando on October 27, 2017 at 9:47 pm said:
  
  Not the purity measure but the actual predicted probability. See the example in http://blog.datadive.net/random-forest-interpretation-with-scikit-learn/ “Classification trees and forests”
  
  Reply ↓
  - julian on November 8, 2017 at 11:27 am said:
    
    Thanks. I see the example. Can you tell me if this method can be applied to categorical/nominal features? I built an example but I realised that after encoding all my categories as integer, the model must be treating them as ordinal or continuous. I am trying to set this up with all the features one-hot encoded, to get around this but it’s then rather difficult to extract any meaning from the contributions. Is this something you’ve explored already?
    
    Reply ↓
    - ando on November 9, 2017 at 7:19 am said:
      
      One hot encoded features work as well if not better for interpretation
      
      Reply ↓
Anonymous on November 28, 2017 at 6:09 pm said:

Will there be an R version? Thanks.

Reply ↓
Pingback: Random forest interpretation with scikit-learn | Premium Blog! | Development code, Android, Ios anh Tranning IT
Jean-Matthieu on January 30, 2018 at 4:41 pm said:

Hello Ando,

Are you aware of any research paper on this computation of “contribution by averaging decision paths over trees” ? All similar implementations in R or python I have found, trace back to this blog post.
Were you (in)directly inspired by some paper, or is it an original “contribution” from yourself?

Many thanks,

Reply ↓
- ando on January 31, 2018 at 5:48 pm said:
  
  It is an independent/original contribution, however I later learned there is a paper on this method from around the same time I first used the method: https://pdfs.semanticscholar.org/28ff/2f3bf5403d7adc58f6aac542379806fa3233.pdf
  
  Reply ↓
  - Jean-Matthieu on February 12, 2018 at 3:11 pm said:
    
    Thanks for the link and congrats for this idea!
    
    I have seen a similar implementation in R (xgboostExplainer, on CRAN). The main difference is that contributions are expressed in log-odds of probability.
    I’m curious about your thoughts of using log-odds, which has the advantage to bring a “bayesian interpretation” of contributions. However, it seems that it is not possible to maintain all additivity properties [1] and [2] ([1] a contribution of feature F is equal to the mean of the contributions of feature F for all decision trees ; [2] the prediction score is equal to the sum of all feature contributions and equal to the mean of prediction score for all decision trees.).
    Any thoughts on using log-odds for contributions?
    
    Reply ↓
Pingback: Computational Prediction - Interpreting Random forest
Pingback: Interpreting Random Forest and other black box models like XGBoost - Coding Videos
Pingback: 使用 Scikit-learn 理解随机森林滕州巴巴
Pingback: Интерпретации чёрных ящиков | Анализ малых данных
vivek on August 29, 2018 at 11:31 am said:

what is the meaning of mtry in random forest

Reply ↓
anon on September 21, 2018 at 10:33 pm said:

From what I understand, in the binary classification case, if I get a contribution = [x,-x] of a feature it means that using this feature I gain x in probability to be of class0.
However this doesn’t give us any information of what the feature value is? Is there a way to extract what values of the feature are making it of class0?

Reply ↓
Pingback: Hands-on Machine Learning Model Interpretation - AI+ NEWS
Pingback: Interpreting Random Forest – Articulate Your Life
Pingback: Explaining Feature Importance by example of a Random Forest – Data Science Austria
Pingback: Let’s Apply Machine Learning in Behavioral Economics – Data Science Austria
Pingback: Machine Learning Algorithms are Not Black Boxes – Data Science Austria
Pingback: Explain Your Model with the SHAP Values – Data Science Austria
Cookie Monster on November 25, 2019 at 5:55 pm said:

I don’t understand why do we need this concept of “contributions” here that makes random forests “white box”. e.g., a random forest with entropy loss itself does an optimization with respect to conditional uncertainty that provides a measure of contribution of the added features in its decision trees. The contribution defined here is an interesting concept. However, I believe it doesn’t add much understanding to the random forests and doesn’t make them “white box”.

Reply ↓
jenkinsjob on December 9, 2019 at 2:51 pm said:

I am looking for Some Interpretable tool like LIME & ELI5, i tried this method to explain but not sure how to plot graph which says which feature contribute for model prediction, can you help me to get plot?
https://onclick360.com/cost-function-in-machine-learning/

Reply ↓
Pingback: Ideas on interpreting machine learning > Seekalgo
klik disini on December 28, 2020 at 10:45 am said:

Ꮋello, yes thiks post is really good and I have learned lot of things frоm it regarding
blogging. thanks.

Reply ↓
Nivir on April 5, 2021 at 8:32 am said:

Hi, how do you generate the tree diagram?

Reply ↓
Pingback: A game theoretic approach to explain the output of any machine learning model – News Priviw
Pingback: Interpreting scikit-learn’s decision tree and random forest predictions – News Priviw
Pingback: Why the discrepancy between predict.xgb.Booster & xgboostexplainer prediction contributions? – GrindSkills
Pingback: 随机森林分类器：预测概率的特征重要性 – 学技术
Pingback: 随机森林中预测值的特征重要性？ – 学技术