Selecting good features – Part III: random forests

In my previous posts, I looked at univariate feature selection and linear models and regularization for feature selection.

In this post, I’ll discuss random forests, another popular approach for feature ranking.

Random forest feature importance

Random forests are among the most popular machine learning methods thanks to their relatively good accuracy, robustness and ease of use. They also provide two straightforward methods for feature selection: mean decrease impurity and mean decrease accuracy.

Mean decrease impurity

Random forest consists of a number of decision trees. Every node in the decision trees is a condition on a single feature, designed to split the dataset into two so that similar response values end up in the same set. The measure based on which the (locally) optimal condition is chosen is called impurity. For classification, it is typically either Gini impurity or information gain/entropy and for regression trees it is variance. Thus when training a tree, it can be computed how much each feature decreases the weighted impurity in a tree. For a forest, the impurity decrease from each feature can be averaged and the features are ranked according to this measure.

This is the feature importance measure exposed in sklearn’s Random Forest implementations (random forest classifier and random forest regressor).

from sklearn.datasets import load_boston
from sklearn.ensemble import RandomForestRegressor
import numpy as np
#Load boston housing dataset as an example
boston = load_boston()
X = boston["data"]
Y = boston["target"]
names = boston["feature_names"]
rf = RandomForestRegressor(), Y)
print "Features sorted by their score:"
print sorted(zip(map(lambda x: round(x, 4), rf.feature_importances_), names), 

Features sorted by their score:
[(0.5298, 'LSTAT'), (0.4116, 'RM'), (0.0252, 'DIS'), (0.0172, 'CRIM'), (0.0065, 'NOX'), (0.0035, 'PTRATIO'), (0.0021, 'TAX'), (0.0017, 'AGE'), (0.0012, 'B'), (0.0008, 'INDUS'), (0.0004, 'RAD'), (0.0001, 'CHAS'), (0.0, 'ZN')]

There are a few things to keep in mind when using the impurity based ranking. Firstly, feature selection based on impurity reduction is biased towards preferring variables with more categories (see Bias in random forest variable importance measures). Secondly, when the dataset has two (or more) correlated features, then from the point of view of the model, any of these correlated features can be used as the predictor, with no concrete preference of one over the others. But once one of them is used, the importance of others is significantly reduced since effectively the impurity they can remove is already removed by the first feature. As a consequence, they will have a lower reported importance. This is not an issue when we want to use feature selection to reduce overfitting, since it makes sense to remove features that are mostly duplicated by other features. But when interpreting the data, it can lead to the incorrect conclusion that one of the variables is a strong predictor while the others in the same group are unimportant, while actually they are very close in terms of their relationship with the response variable.

The effect of this phenomenon is somewhat reduced thanks to random selection of features at each node creation, but in general the effect is not removed completely. In the following example, we have three correlated variables \(X_0, X_1, X_2\), and no noise in the data, with the output variable simply being the sum of the three features:

size = 10000
X_seed = np.random.normal(0, 1, size)
X0 = X_seed + np.random.normal(0, .1, size)
X1 = X_seed + np.random.normal(0, .1, size)
X2 = X_seed + np.random.normal(0, .1, size)
X = np.array([X0, X1, X2]).T
Y = X0 + X1 + X2
rf = RandomForestRegressor(n_estimators=20, max_features=2), Y);
print "Scores for X0, X1, X2:", map(lambda x:round (x,3),

Scores for X0, X1, X2: [0.278, 0.66, 0.062]

When we compute the feature importances, we see that \(X_1\) is computed to have over 10x higher importance than \(X_2\), while their “true” importance is very similar. This happens despite the fact that the data is noiseless, we use 20 trees, random selection of features (at each split, only two of the three features are considered) and a sufficiently large dataset.

One thing to point out though is that the difficulty of interpreting the importance/ranking of correlated variables is not random forest specific, but applies to most model based feature selection methods.

Mean decrease accuracy

Another popular feature selection method is to directly measure the impact of each feature on accuracy of the model. The general idea is to permute the values of each feature and measure how much the permutation decreases the accuracy of the model. Clearly, for unimportant variables, the permutation should have little to no effect on model accuracy, while permuting important variables should significantly decrease it.

This method is not directly exposed in sklearn, but it is straightforward to implement it. Continuing from the previous example of ranking the features in the Boston housing dataset:

from sklearn.cross_validation import ShuffleSplit
from sklearn.metrics import r2_score
from collections import defaultdict

X = boston["data"]
Y = boston["target"]

rf = RandomForestRegressor()
scores = defaultdict(list)

#crossvalidate the scores on a number of different random splits of the data
for train_idx, test_idx in ShuffleSplit(len(X), 100, .3):
    X_train, X_test = X[train_idx], X[test_idx]
    Y_train, Y_test = Y[train_idx], Y[test_idx]
    r =, Y_train)
    acc = r2_score(Y_test, rf.predict(X_test))
    for i in range(X.shape[1]):
        X_t = X_test.copy()
        np.random.shuffle(X_t[:, i])
        shuff_acc = r2_score(Y_test, rf.predict(X_t))
print "Features sorted by their score:"
print sorted([(round(np.mean(score), 4), feat) for 
              feat, score in scores.items()], reverse=True)

Features sorted by their score:
[(0.7276, 'LSTAT'), (0.5675, 'RM'), (0.0867, 'DIS'), (0.0407, 'NOX'), (0.0351, 'CRIM'), (0.0233, 'PTRATIO'), (0.0168, 'TAX'), (0.0122, 'AGE'), (0.005, 'B'), (0.0048, 'INDUS'), (0.0043, 'RAD'), (0.0004, 'ZN'), (0.0001, 'CHAS')]

In this example LSTAT and RM are two features that strongly impact model performance: permuting them decreases model performance by ~73% and ~57% respectively. Keep in mind though that these measurements are made only after the model has been trained (and is depending) on all of these features. This doesn’t mean that if we train the model without one these feature, the model performance will drop by that amount, since other, correlated features can be used instead.


Random forests are a popular method for feature ranking, since they are so easy to apply: in general they require very little feature engineering and parameter tuning and mean decrease impurity is exposed in most random forest libraries. But they come with their own gotchas, especially when data interpretation is concerned. With correlated features, strong features can end up with low scores and the method can be biased towards variables with many categories. As long as the gotchas are kept in mind, there really is no reason not to try them out on your data.

Next up: Stability selection, recursive feature elimination, and an example comparing all discussed methods side by side.

57 comments on “Selecting good features – Part III: random forests

  1. I think in the article under the second screenshot, he means to imply X2 and X1 not X3 and X2 (There is not X3 in the datatset provided)

  2. Pingback: Predicting Loan Default – Developing a fraud detection system | Niall Martin

  3. Great post. I have 2 questions about your second method – mean decrease accuracy:
    a. If you are doing a gridsearch, does the GridSearchCV() have to be performed before the for loop (i.e. before line 12)? Another way to word this question: should the for loop (lines 12-21) be run on the (i) regressor with tuned hyperparameters or (ii) default regressor (default hyperparameters)?
    b. Do you know if this method is still not exposed in scikit-learn?

    • If you use gridsearch to find the best model, then you should indeed run it before the feature selection.

      Afaik the methodis not exposed in scikit learn

      • Thanks! I typically believed that first one would select features and then tune the model based on those features. Do you know why the gridsearch should be run before selecting the features?

        • You typically use feature selection in Random Forest to gain a better understanding of data, in terms of gaining an insight which features have an impact on the response etc. So for this, you use a good model, obtained by gridserach for example.

          Going the other way (selecting features and the optimizing the model) isn’t wrong per se, just that in the RF setting it is not that useful, as RF already performs implicit feature selection, so you don’t need to pre-pick your features in general.

          • “…you don’t need to pre-pick your features in general.”
            The calculation of all features could be too time consuming.

  4. Could you explain these 3 lines:

    np.random.shuffle(X_t[:, i])
    shuff_acc = r2_score(Y_test, rf.predict(X_t))

    In the np.random() line, are you shuffling the feature rows (i.e. shuffling the order of the samples) – i.e. samples 10 and 5 would be swapped? If so, then on the very next line, r2_score(Y_test, rf.predict(X_t)), would you also need to shuffle the Y_test in the exact same way before calculating the r2_score()? I mean if the X_t is shuffled then would the samples be out of order compared to the Y_test?

      • It would indicate that the benefit of having the feature is negative. For example if the feature is pure noise, then shuffling it can just by chance increase its predictiveness ver slightly, resulting in the negative value. In practice it just mean’s it’s a useless feature

    • I only shuffle it for one feature, all other features stay as is. I simply want to see how well I can predict Y_test if that particular feature is shuffled

      • Hi,
        Thank you for such great article.
        Why don’t you just delete the column? Shuffle is random changes, but what if we have a particular variable x which could have only {0,1,2}, by shuffling this features columns we might not 100% remove feature impact.
        Thanks for your great blog.

  5. Thank you for your highly informative post! Regarding your Summary section, could you explain how we can check for correlated features? Do you mean calculate pearson’s correlation coefficient between each feature and the target column:

    for j in range(X.shape[0]):
    print np.corrcoef([X[:,j],Y])

    If so, does this also work for classification?

    Also, how can we find the number of categories for each feature? What is meant here by the term “categories”?

  6. In the second method, if you have two important but correlated features, if you permute one and leave the other Ok, wouldn’t the resulting prediction not be particularly affected, and thus the two important correlated features would show up as unimportant?

  7. Great post – thanks!

    Quick question: due to the reasons explained above, would the mean decrease accuracy be a better measure of variable importance or would it also be effected in the same way by the correlation bias?

  8. Pingback: Are categorical variables getting lost in your random forests?

  9. Pingback: Stepping away from linear interpolation

  10. Great post, thanks. What about if we’re populating the minority with, say, SMOTE, when dealing with imbalanced data sets? Which one do you think would be the correct approach: Applying the feature importance with or without adding the newly generated minority class’ examples to the data set?

  11. IMHO, your second approach is indeed very similar (if not exactly the same) to the internal calculation of variable importance of typical RF variable importance calculation (hence similar or the same to the first approach)
    Except maybe the typical RF variable importance calculation is performed (using training data ofc) only on the OOB samples for individual tree, and your second approach is basically using all the samples. This technique is formally known as Mean Decrease Accuracy or permutation importance:

    Best regards,

  12. Pingback: Classification and Regression – Min Liang's blog

  13. Pingback: Feature Engineering – Min Liang's blog

  14. Pingback: Week 6: Revisiting feature importances and effect of feature reduction on model performance | SURF 2017

  15. Pingback: Improving the Random Forest in Python Part 1 | Copy Paste Programmers

  16. Pingback: Data scientist’s toolbox - Pro Analytics Expert

  17. Pingback: Regression Coefficients as independent variables in second model – Nevin Manimala’s Blog

  18. I don’t think the data you simulated are correlated… sure they come from the same distribution with the same mean and standard deviation but to actually simulate correlated predictors wouldn’t you need to use a multivariate normal with a variance-covariance matrix containing the correlation coefficients on the off-diagnols?

    • I’m with you. Pulling from the same normal distribution doesn’t mean that the features would be correlated.

      Otherwise, great post.

      • I think you are misreading the code. I’m not pulling from the same distribution, i’m pulling noise from the same distribution. X0 to X2 are actually the same variable X_seed with some noise added, making them very strongly correlated with a corrcoef of 0.99.

  19. Hello, Can I use random forest solely for feature selection, irrespective of accuracy it gives on test data? I have a data-set of 960×206. I want to use random forest to pick up important variables here. My model has given 20% OOB(which is very high) and gave 61% accuracy on test data. Now that I have a low accuracy, can I confidently assume that the features I have selected are good ones?

  20. Pingback: Feature Selection algorithms - Best Practice - Dawid Kopczyk

  21. In, the problem of determining the best feature using Random forests, does it make sense to permute the feature order in the training data and then make an interpretation? For eg, X_0 and X_1 have same importance because they are correlated but the model identifies X_0 to be more important and thereafter the importance of X_1 decreases. But if we permute the order of the features and later X_1 appears to be more important, we can conclude that both have similar importance. Is there any such practice?

  22. Pingback: Идентификация на основе множества признаков — Моделирование и распознавание 2D/3D образов

  23. After I initially commented I seem to have clicked
    the -Notify me when new comments are added- checkbox and from now on whenever
    a comment is added I receive four emails with the
    same comment. Is there a means you are able to remove
    me from that service? Thanks!

  24. Pingback: From Decision Trees to Gradient Boosting - Dawid Kopczyk

  25. Pingback: Variable selection in Python, part I | MyCarta

  26. Hello! Thank you for this great and very useful article. I was trying to reproduce you code however I received an error: TypeError: ‘ShuffleSplit’ object is not iterable. Could you please suggest a solution?

    • It looks like the function has been updated.

      Try this:
      rs = ShuffleSplit(n_splits=10, test_size=0.3, random_state=42)
      for train_idx, test_idx in rs.split(X):

  27. Regarding max_features=2.
    actually, it is not only 2 features. it is 2 features, if no split is found, then it takes max_features=n (3).
    from here:
    “max_features : int, float, string or None, optional (default=”auto”)
    The number of features to consider when looking for the best split:

    If int, then consider max_features features at each split.
    If float, then max_features is a fraction and int(max_features * n_features) features are considered at each split.
    If “auto”, then max_features=sqrt(n_features).
    If “sqrt”, then max_features=sqrt(n_features) (same as “auto”).
    If “log2”, then max_features=log2(n_features).
    If None, then max_features=n_features.
    Note: the search for a split does not stop until at least one valid partition of the node samples is found, even if it requires to effectively inspect more than max_features features.

    this means, that it doesn’t neccesarily use only 2 features. which makes the entire point of the max_features option a bit useless in my opinion

  28. Hello, Can I use random forest solely for feature selection, irrespective of accuracy it gives on test data? I have a data-set of 960×206. I want to use random forest to pick up important variables here. My model has given 20% OOB(which is very high) and gave 61% accuracy on test data. Now that I have a low accuracy, can I confidently assume that the features I have selected are good ones?

  29. Pingback: Kaggle Titanic Competition: построение моделей и тюнинг на Python

  30. Pingback: Using Data Science to Make Your Next Trip on Boston Airbnb – Data Science Austria

  31. Thank you for the post, very helpful. A question for the ‘Mean decrease accuracy’, L20:
    shouldn’t it be: shuff_acc = r2_score(Y_test, r.predict(X_t))? i.e., the model should be ‘r’ rather than ‘rf’?

  32. Pingback: LASSO or random forest (RF) to use for variable selection when having highly correlated features in a relatively small dataset with many features? – GrindSkills

  33. Pingback: Random forests feature selection [closed] – GrindSkills

  34. Pingback: Understanding Permutation Feature Importance: The default Random Forest Feature importance is not reliable

Leave a Reply

Your email address will not be published. Required fields are marked *