Selecting good features – Part IV: stability selection, RFE and everything side by side

In my previous posts, I looked at univariate methods,linear models and regularization and random forests for feature selection.

In this post, I’ll look at two other methods: stability selection and recursive feature elimination (RFE), which can both considered wrapper methods. They both build on top of other (model based) selection methods such as regression or SVM, building models on different subsets of data and extracting the ranking from the aggregates.

As a wrap-up I’ll run all previously discussed methods, to highlight their pros, cons and gotchas with respect to each other.

Stability selection

Stability selection is a relatively novel method for feature selection, based on subsampling in combination with selection algorithms (which could be regression, SVMs or other similar method). The high level idea is to apply a feature selection algorithm on different subsets of data and with different subsets of features. After repeating the process a number of times, the selection results can be aggregated, for example by checking how many times a feature ended up being selected as important when it was in an inspected feature subset. We can expect strong features to have scores close to 100%, since they are always selected when possible. Weaker, but still relevant features will also have non-zero scores, since they would be selected when stronger features are not present in the currently selected subset, while irrelevant features would have scores (close to) zero, since they would never be among selected features.

Sklearn implements stability selection in the randomized lasso and randomized logistics regression classes.

from sklearn.linear_model import RandomizedLasso
from sklearn.datasets import load_boston
boston = load_boston()

#using the Boston housing data. 
#Data gets scaled automatically by sklearn's implementation
X = boston["data"]
Y = boston["target"]
names = boston["feature_names"]

rlasso = RandomizedLasso(alpha=0.025)
rlasso.fit(X, Y)

print "Features sorted by their score:"
print sorted(zip(map(lambda x: round(x, 4), rlasso.scores_), 
                 names), reverse=True)


Features sorted by their score:
[(1.0, 'RM'), (1.0, 'PTRATIO'), (1.0, 'LSTAT'), (0.62, 'CHAS'), (0.595, 'B'), (0.39, 'TAX'), (0.385, 'CRIM'), (0.25, 'DIS'), (0.22, 'NOX'), (0.125, 'INDUS'), (0.045, 'ZN'), (0.02, 'RAD'), (0.015, 'AGE')]

As you can see from the example, the top 3 features have equal scores of 1.0, meaning they were always selected as useful features (of course this could and would change when changing the regularization parameter, but sklearn’s randomized lasso implementation can choose a good \(\alpha\) parameter automatically). The scores drop smoothly from there, but in general, the drop off is not sharp as is often the case with pure lasso, or random forest. This means stability selection is useful for both pure feature selection to reduce overfitting, but also for data interpretation: in general, good features won’t get 0 as coefficients just because there are similar, correlated features in the dataset (as is the case with lasso). For feature selection, I’ve found it to be among the top performing methods for many different datasets and settings.

Recursive feature elimination

Recursive feature elimination is based on the idea to repeatedly construct a model (for example an SVM or a regression model) and choose either the best or worst performing feature (for example based on coefficients), setting the feature aside and then repeating the process with the rest of the features. This process is applied until all features in the dataset are exhausted. Features are then ranked according to when they were eliminated. As such, it is a greedy optimization for finding the best performing subset of features.

The stability of RFE depends heavily on the type of model that is used for feature ranking at each iteration. Just as non-regularized regression can be unstable, so can RFE when utilizing it, while using ridge regression can provide more stable results.

Sklearn provides RFE for recursive feature elimination and RFECV for finding the ranks together with optimal number of features via a cross validation loop.

from sklearn.feature_selection import RFE
from sklearn.linear_model import LinearRegression

boston = load_boston()
X = boston["data"]
Y = boston["target"]
names = boston["feature_names"]

#use linear regression as the model
lr = LinearRegression()
#rank all features, i.e continue the elimination until the last one
rfe = RFE(lr, n_features_to_select=1)
rfe.fit(X,Y)

print "Features sorted by their rank:"
print sorted(zip(map(lambda x: round(x, 4), rfe.ranking_), names))


Features sorted by their rank:
[(1.0, 'NOX'), (2.0, 'RM'), (3.0, 'CHAS'), (4.0, 'PTRATIO'), (5.0, 'DIS'), (6.0, 'LSTAT'), (7.0, 'RAD'), (8.0, 'CRIM'), (9.0, 'INDUS'), (10.0, 'ZN'), (11.0, 'TAX'), (12.0, 'B'), (13.0, 'AGE')]

Example: running the methods side by side

I’ll now take all the examples from this post, and the three previous ones and run the methods on a sample dataset to compare them side by side. The dataset will be the so called Friedman #1 regression dataset (from Friedman’s Multivariate Adaptive Regression Splines paper). The data is generated according to formula \(y = 10sin(\pi x_1 x_2) + 20(x_3 – 0.5)^2 + 10X_4 + 5X_5 +\epsilon\), where the \(x_1\) to \(x_5\) are drawn from uniform distribution and \(\epsilon\) is the standard normal deviate \(N(0,1)\). Additionally, the original dataset had five noise variables \(x_6,…,x_{10}\), independent of the response variable. We will increase the number of variables further and add four variables \(x_{11},…,x_{14}\) each of which are very strongly correlated with \(x_1,…,x_4\), respectively, generated by \(f(x) = x + N(0, 0.01)\). This yields a correlation coefficient of more than 0.999 between the variables. This will illustrate how different feature ranking methods deal with correlations in the data.

We’ll apply run each of the above listed methods on the dataset and normalize the scores so that that are between 0 (for lowest ranked feature) and 1 (for the highest feature). For recursive feature elimination, the top five feature will all get score 1, with the rest of the ranks spaced equally between 0 and 1 according to their rank.

from sklearn.datasets import load_boston
from sklearn.linear_model import (LinearRegression, Ridge, 
                                  Lasso, RandomizedLasso)
from sklearn.feature_selection import RFE, f_regression
from sklearn.preprocessing import MinMaxScaler
from sklearn.ensemble import RandomForestRegressor
import numpy as np
from minepy import MINE

np.random.seed(0)

size = 750
X = np.random.uniform(0, 1, (size, 14))

#"Friedamn #1” regression problem
Y = (10 * np.sin(np.pi*X[:,0]*X[:,1]) + 20*(X[:,2] - .5)**2 + 
     10*X[:,3] + 5*X[:,4] + np.random.normal(0,1))
#Add 3 additional correlated variables (correlated with X1-X3)
X[:,10:] = X[:,:4] + np.random.normal(0, .025, (size,4))

names = ["x%s" % i for i in range(1,15)]

ranks = {}

def rank_to_dict(ranks, names, order=1):
    minmax = MinMaxScaler()
    ranks = minmax.fit_transform(order*np.array([ranks]).T).T[0]
    ranks = map(lambda x: round(x, 2), ranks)
    return dict(zip(names, ranks ))

lr = LinearRegression(normalize=True)
lr.fit(X, Y)
ranks["Linear reg"] = rank_to_dict(np.abs(lr.coef_), names)

ridge = Ridge(alpha=7)
ridge.fit(X, Y)
ranks["Ridge"] = rank_to_dict(np.abs(ridge.coef_), names)


lasso = Lasso(alpha=.05)
lasso.fit(X, Y)
ranks["Lasso"] = rank_to_dict(np.abs(lasso.coef_), names)


rlasso = RandomizedLasso(alpha=0.04)
rlasso.fit(X, Y)
ranks["Stability"] = rank_to_dict(np.abs(rlasso.scores_), names)

#stop the search when 5 features are left (they will get equal scores)
rfe = RFE(lr, n_features_to_select=5)
rfe.fit(X,Y)
ranks["RFE"] = rank_to_dict(map(float, rfe.ranking_), names, order=-1)

rf = RandomForestRegressor()
rf.fit(X,Y)
ranks["RF"] = rank_to_dict(rf.feature_importances_, names)


f, pval  = f_regression(X, Y, center=True)
ranks["Corr."] = rank_to_dict(f, names)

mine = MINE()
mic_scores = []
for i in range(X.shape[1]):
    mine.compute_score(X[:,i], Y)
    m = mine.mic()
    mic_scores.append(m)

ranks["MIC"] = rank_to_dict(mic_scores, names) 


r = {}
for name in names:
    r[name] = round(np.mean([ranks[method][name] 
                             for method in ranks.keys()]), 2)

methods = sorted(ranks.keys())
ranks["Mean"] = r
methods.append("Mean")

print "\t%s" % "\t".join(methods)
for name in names:
    print "%s\t%s" % (name, "\t".join(map(str, 
                         [ranks[method][name] for method in methods])))

Here’s the resulting table (sortable by clicking on the column header), with the results from each method + the mean
[table]
Feature,Lin. corr.,Linear reg.,Lasso,MIC,RF,RFE,Ridge,Stability,Mean
x1,0.3,1.0,0.79,0.39,0.18,1.0,0.77,0.61,0.63
x2,0.44,0.56,0.83,0.61,0.24,1.0,0.75,0.7,0.64
x3,0.0,0.5,0.0,0.34,0.01,1.0,0.05,0.0,0.24
x4,1.0,0.57,1.0,1.0,0.45,1.0,1.0,1.0,0.88
x5,0.1,0.27,0.51,0.2,0.04,0.78,0.88,0.6,0.42
x6,0.0,0.02,0.0,0.0,0.0,0.44,0.05,0.0,0.06
x7,0.01,0.0,0.0,0.07,0.0,0.0,0.01,0.0,0.01
x8,0.02,0.03,0.0,0.05,0.0,0.56,0.09,0.0,0.09
x9,0.01,0.0,0.0,0.09,0.0,0.11,0.0,0.0,0.03
x10,0.0,0.01,0.0,0.04,0.0,0.33,0.01,0.0,0.05
x11,0.29,0.6,0.0,0.43,0.14,1.0,0.59,0.39,0.43
x12,0.44,0.14,0.0,0.71,0.12,0.67,0.68,0.42,0.4
x13,0.0,0.48,0.0,0.23,0.01,0.89,0.02,0.0,0.2
x14,0.99,0.0,0.16,1.0,1.0,0.22,0.95,0.53,0.61
[/table]

The example should highlight some the interesting characteristics of the different methods.

With linear correlation (Lin. corr.), each feature is evaluated independently, so the scores for features \(x_1…x_4\) are very similar to \(x_{11}…x_{14}\), while the noise features \(x_5…x_{10}\) are correctly identified to have almost no relation with the response variable. It’s not able to identify any relationship between \(x_3\) and the response variable, since the relationship is quadratic (in fact, this applies almost all other methods except for MIC). It’s also clear that while the method is able to measure the linear relationship between each feature and the response variable, it is not optimal for selecting the top performing features for improving the generalization of a model, since all top performing features would essentially be picked twice.

Lasso picks out the top performing features, while forcing other features to be close to zero. It is clearly useful when reducing the number of features is required, but not necessarily for data interpretation (since it might lead one to believe that features \(x_{11}…x_{13}\) do not have a strong relationship with the output variable).

MIC is similar to correlation coefficient in treating all features “equally”, additionally it is able to find the non-linear a relationship between \(x_3\) and the response.

Random forest’s impurity based ranking is typically aggressive in the sense that there is a sharp drop-off of scores after the first few top ones. This can be seen from the example where the third ranked feature has already 4x smaller score than the top feature (whereas for the other ranking methods, the drop-off is clearly not that aggressive).

Ridge regression forces regressions coefficients to spread out similarly between correlated variables. This is clearly visible in the example where \(x_{11}…x_{14}\) are close to \(x_1…x_4\) in terms of scores.

Stability selection is often able to make a useful compromise between data interpretation and top feature selection for model improvement. This is illustrated well in the example. Just like Lasso it is able to identify the top features (\(x_1\), \(x_2\), \(x_4\), \(x_5\)). At the same time their correlated shadow variables also get a high score, illustrating their relation with the response.

Conclusions

Feature ranking can be incredibly useful in a number of machine learning and data mining scenarios. The key though is to have the end goal clearly in mind and understand which method works best for achieving it. When selecting top features for model performance improvement, it is easy to verify if a particular method works well against alternatives simply by doing cross-validation. It’s not as straightforward when using feature ranking for data interpretation, where stability of the ranking method is crucial and a method that doesn’t have this property (such as lasso) could easily lead to incorrect conclusions. What can help there is subsampling the data and running the selection algorithms on the subsets. If the results are consistent across the subsets, it is relatively safe to trust the stability of the method on this particular data and therefor straightforward to interpret the data in terms of the ranking.

45 comments on “Selecting good features – Part IV: stability selection, RFE and everything side by side

  1. Great series on feature selection! It’s direct and intuitive unlike many of the ML texts which skirt around the topic but never address it directly. The code examples are especially useful. I’ve been using mostly using linear models and random forests for feature selection, I’m glad to learn about stability selection and the others.

  2. Most of the examples I have are equally applicable for classification.
    For example in sklearn you can use RandomForestClassifier instead of RandomForestRegressor, LogisticRegression (it includes l1 penalty option) instead of Lasso etc.

  3. Thanks for you great article, very impressive. I ‘d like to translate them into Chinese version so that more people in the world can learn these articles if you agree.

    • And how did you selected methods parameters? For different data some methods could give less informative results, giving too much first ranks (or vice verse, too much zeroes).
      To compensate this, you can calculate mean as weightened sum with weights as average method rank.

  4. Great series of posts!

    I think that the reason why most methods give x3 a low score isn’t that is quadratically related to the dependent variable, but the specific form of the term and the domain for X. You should see this by changing the interval to 0.5 to 1.5 (or removing the 0.5 of that term). In particular, Random Forest is a non-linear algorithm, so it should see the relation between x3 and Y regardless of the square.

    Thanks!

    Thanks!

  5. Pingback: selection fea | Baruch Amoussou's blog

  6. Awesome serie of posts!
    I’ve never learned that much thing in any other post on the Internet.

    Respect! Thanks a lot!

  7. Hi, one question to ask. When i am doing feature engineering, should i perform feature selection first? Or model selection first? Thanks.

    • Depends on the relationship. So it’s best to plot/understand the data first.
      An option is to generate a lot of different transformations (log, square, sqrt) and the apply lasso to see which (transformed) features come out on top.

  8. Pingback: Feature Engineering – Min Liang's blog

  9. Hi,

    Congrats for the very clear and useful tutorial !
    I have a question: Why did you normalize only the data for Linear Regression (normalize=True) and not for the other methods, including Lasso and RandomizedLasso that also have this parameter option?

    Thank you in advance!

  10. Can I just say what a comfort too uncover an individual who really understands what they
    are discussing on the internet. You definitely realize how to bring an issue to light and make it
    important. More people really need to check thgis out and understand
    this side of the story. It’s surprising you are not more popula
    since you definitely have the gift.

  11. I’m impressed, I have to admit. Rarely do I encounter a blog that’s both equally educative and engaging, and without a doubt, you’ve hit the nail
    on the head. The issue is something not enough men and women are speaking intelligently about.
    I am very happy I found this during my hunt for
    something concerning this.

  12. The biggest danger with reserving by way of a hotel booking site
    is not that you’re going to be scammed out of your cash by an illegitimate
    enterprise (the entire web sites that I listing in here are respected)
    however that they’ll overbook your resort and you’ll get an email just a few weeks earlier than your keep saying that your room isn’t accessible.

  13. Excellent website you have here but I was wanting to
    know if you knew of any community forums that cover the same topics discussed here?
    I’d really like to be a part of community where
    I can get comments from other experienced people that share the same interest.
    If you have any suggestions, please let me know. Bless you!

  14. Pingback: Идентификация на основе множества признаков — Моделирование и распознавание 2D/3D образов

  15. Pingback: Tremendous Bowl Prediction Mannequin - In the direction of Knowledge Science - TechMintz

  16. Pingback: Reducing Number of Features for Inference – Data Science Austria

  17. Pingback: Important DS AI ML Articles -

  18. Pingback: Understanding Permutation Feature Importance: The default Random Forest Feature importance is not reliable

Leave a Reply

Your email address will not be published. Required fields are marked *