In my previous posts, I looked at univariate methods,linear models and regularization and random forests for feature selection.

In this post, I’ll look at two other methods: stability selection and recursive feature elimination (RFE), which can both considered wrapper methods. They both build on top of other (model based) selection methods such as regression or SVM, building models on different subsets of data and extracting the ranking from the aggregates.

As a wrap-up I’ll run all previously discussed methods, to highlight their pros, cons and gotchas with respect to each other.

# Stability selection

Stability selection is a relatively novel method for feature selection, based on subsampling in combination with selection algorithms (which could be regression, SVMs or other similar method). The high level idea is to apply a feature selection algorithm on different subsets of data and with different subsets of features. After repeating the process a number of times, the selection results can be aggregated, for example by checking how many times a feature ended up being selected as important when it was in an inspected feature subset. We can expect strong features to have scores close to 100%, since they are always selected when possible. Weaker, but still relevant features will also have non-zero scores, since they would be selected when stronger features are not present in the currently selected subset, while irrelevant features would have scores (close to) zero, since they would never be among selected features.

Sklearn implements stability selection in the randomized lasso and randomized logistics regression classes.

from sklearn.linear_model import RandomizedLasso from sklearn.datasets import load_boston boston = load_boston() #using the Boston housing data. #Data gets scaled automatically by sklearn's implementation X = boston["data"] Y = boston["target"] names = boston["feature_names"] rlasso = RandomizedLasso(alpha=0.025) rlasso.fit(X, Y) print "Features sorted by their score:" print sorted(zip(map(lambda x: round(x, 4), rlasso.scores_), names), reverse=True)

Features sorted by their score:

[(1.0, 'RM'), (1.0, 'PTRATIO'), (1.0, 'LSTAT'), (0.62, 'CHAS'), (0.595, 'B'), (0.39, 'TAX'), (0.385, 'CRIM'), (0.25, 'DIS'), (0.22, 'NOX'), (0.125, 'INDUS'), (0.045, 'ZN'), (0.02, 'RAD'), (0.015, 'AGE')]

As you can see from the example, the top 3 features have equal scores of 1.0, meaning they were always selected as useful features (of course this could and would change when changing the regularization parameter, but sklearn’s randomized lasso implementation can choose a good \(\alpha\) parameter automatically). The scores drop smoothly from there, but in general, the drop off is not sharp as is often the case with pure lasso, or random forest. This means stability selection is useful for both pure feature selection to reduce overfitting, but also for data interpretation: in general, good features won’t get 0 as coefficients just because there are similar, correlated features in the dataset (as is the case with lasso). For feature selection, I’ve found it to be among the top performing methods for many different datasets and settings.

# Recursive feature elimination

Recursive feature elimination is based on the idea to repeatedly construct a model (for example an SVM or a regression model) and choose either the best or worst performing feature (for example based on coefficients), setting the feature aside and then repeating the process with the rest of the features. This process is applied until all features in the dataset are exhausted. Features are then ranked according to when they were eliminated. As such, it is a greedy optimization for finding the best performing subset of features.

The stability of RFE depends heavily on the type of model that is used for feature ranking at each iteration. Just as non-regularized regression can be unstable, so can RFE when utilizing it, while using ridge regression can provide more stable results.

Sklearn provides RFE for recursive feature elimination and RFECV for finding the ranks together with optimal number of features via a cross validation loop.

from sklearn.feature_selection import RFE from sklearn.linear_model import LinearRegression boston = load_boston() X = boston["data"] Y = boston["target"] names = boston["feature_names"] #use linear regression as the model lr = LinearRegression() #rank all features, i.e continue the elimination until the last one rfe = RFE(lr, n_features_to_select=1) rfe.fit(X,Y) print "Features sorted by their rank:" print sorted(zip(map(lambda x: round(x, 4), rfe.ranking_), names))

Features sorted by their rank:

[(1.0, 'NOX'), (2.0, 'RM'), (3.0, 'CHAS'), (4.0, 'PTRATIO'), (5.0, 'DIS'), (6.0, 'LSTAT'), (7.0, 'RAD'), (8.0, 'CRIM'), (9.0, 'INDUS'), (10.0, 'ZN'), (11.0, 'TAX'), (12.0, 'B'), (13.0, 'AGE')]

# Example: running the methods side by side

I’ll now take all the examples from this post, and the three previous ones and run the methods on a sample dataset to compare them side by side. The dataset will be the so called Friedman #1 regression dataset (from Friedman’s Multivariate Adaptive Regression Splines paper). The data is generated according to formula \(y = 10sin(\pi x_1 x_2) + 20(x_3 – 0.5)^2 + 10X_4 + 5X_5 +\epsilon\), where the \(x_1\) to \(x_5\) are drawn from uniform distribution and \(\epsilon\) is the standard normal deviate \(N(0,1)\). Additionally, the original dataset had five noise variables \(x_6,…,x_{10}\), independent of the response variable. We will increase the number of variables further and add four variables \(x_{11},…,x_{14}\) each of which are very strongly correlated with \(x_1,…,x_4\), respectively, generated by \(f(x) = x + N(0, 0.01)\). This yields a correlation coefficient of more than 0.999 between the variables. This will illustrate how different feature ranking methods deal with correlations in the data.

We’ll apply run each of the above listed methods on the dataset and normalize the scores so that that are between 0 (for lowest ranked feature) and 1 (for the highest feature). For recursive feature elimination, the top five feature will all get score 1, with the rest of the ranks spaced equally between 0 and 1 according to their rank.

from sklearn.datasets import load_boston from sklearn.linear_model import (LinearRegression, Ridge, Lasso, RandomizedLasso) from sklearn.feature_selection import RFE, f_regression from sklearn.preprocessing import MinMaxScaler from sklearn.ensemble import RandomForestRegressor import numpy as np from minepy import MINE np.random.seed(0) size = 750 X = np.random.uniform(0, 1, (size, 14)) #"Friedamn #1” regression problem Y = (10 * np.sin(np.pi*X[:,0]*X[:,1]) + 20*(X[:,2] - .5)**2 + 10*X[:,3] + 5*X[:,4] + np.random.normal(0,1)) #Add 3 additional correlated variables (correlated with X1-X3) X[:,10:] = X[:,:4] + np.random.normal(0, .025, (size,4)) names = ["x%s" % i for i in range(1,15)] ranks = {} def rank_to_dict(ranks, names, order=1): minmax = MinMaxScaler() ranks = minmax.fit_transform(order*np.array([ranks]).T).T[0] ranks = map(lambda x: round(x, 2), ranks) return dict(zip(names, ranks )) lr = LinearRegression(normalize=True) lr.fit(X, Y) ranks["Linear reg"] = rank_to_dict(np.abs(lr.coef_), names) ridge = Ridge(alpha=7) ridge.fit(X, Y) ranks["Ridge"] = rank_to_dict(np.abs(ridge.coef_), names) lasso = Lasso(alpha=.05) lasso.fit(X, Y) ranks["Lasso"] = rank_to_dict(np.abs(lasso.coef_), names) rlasso = RandomizedLasso(alpha=0.04) rlasso.fit(X, Y) ranks["Stability"] = rank_to_dict(np.abs(rlasso.scores_), names) #stop the search when 5 features are left (they will get equal scores) rfe = RFE(lr, n_features_to_select=5) rfe.fit(X,Y) ranks["RFE"] = rank_to_dict(map(float, rfe.ranking_), names, order=-1) rf = RandomForestRegressor() rf.fit(X,Y) ranks["RF"] = rank_to_dict(rf.feature_importances_, names) f, pval = f_regression(X, Y, center=True) ranks["Corr."] = rank_to_dict(f, names) mine = MINE() mic_scores = [] for i in range(X.shape[1]): mine.compute_score(X[:,i], Y) m = mine.mic() mic_scores.append(m) ranks["MIC"] = rank_to_dict(mic_scores, names) r = {} for name in names: r[name] = round(np.mean([ranks[method][name] for method in ranks.keys()]), 2) methods = sorted(ranks.keys()) ranks["Mean"] = r methods.append("Mean") print "\t%s" % "\t".join(methods) for name in names: print "%s\t%s" % (name, "\t".join(map(str, [ranks[method][name] for method in methods])))

Here’s the resulting table (sortable by clicking on the column header), with the results from each method + the mean

Feature | Lin. corr. | Linear reg. | Lasso | MIC | RF | RFE | Ridge | Stability | Mean |
---|---|---|---|---|---|---|---|---|---|

x1 | 0.3 | 1.0 | 0.79 | 0.39 | 0.18 | 1.0 | 0.77 | 0.61 | 0.63 |

x2 | 0.44 | 0.56 | 0.83 | 0.61 | 0.24 | 1.0 | 0.75 | 0.7 | 0.64 |

x3 | 0.0 | 0.5 | 0.0 | 0.34 | 0.01 | 1.0 | 0.05 | 0.0 | 0.24 |

x4 | 1.0 | 0.57 | 1.0 | 1.0 | 0.45 | 1.0 | 1.0 | 1.0 | 0.88 |

x5 | 0.1 | 0.27 | 0.51 | 0.2 | 0.04 | 0.78 | 0.88 | 0.6 | 0.42 |

x6 | 0.0 | 0.02 | 0.0 | 0.0 | 0.0 | 0.44 | 0.05 | 0.0 | 0.06 |

x7 | 0.01 | 0.0 | 0.0 | 0.07 | 0.0 | 0.0 | 0.01 | 0.0 | 0.01 |

x8 | 0.02 | 0.03 | 0.0 | 0.05 | 0.0 | 0.56 | 0.09 | 0.0 | 0.09 |

x9 | 0.01 | 0.0 | 0.0 | 0.09 | 0.0 | 0.11 | 0.0 | 0.0 | 0.03 |

x10 | 0.0 | 0.01 | 0.0 | 0.04 | 0.0 | 0.33 | 0.01 | 0.0 | 0.05 |

x11 | 0.29 | 0.6 | 0.0 | 0.43 | 0.14 | 1.0 | 0.59 | 0.39 | 0.43 |

x12 | 0.44 | 0.14 | 0.0 | 0.71 | 0.12 | 0.67 | 0.68 | 0.42 | 0.4 |

x13 | 0.0 | 0.48 | 0.0 | 0.23 | 0.01 | 0.89 | 0.02 | 0.0 | 0.2 |

x14 | 0.99 | 0.0 | 0.16 | 1.0 | 1.0 | 0.22 | 0.95 | 0.53 | 0.61 |

The example should highlight some the interesting characteristics of the different methods.

With **linear correlation** (Lin. corr.), each feature is evaluated independently, so the scores for features \(x_1…x_4\) are very similar to \(x_{11}…x_{14}\), while the noise features \(x_5…x_{10}\) are correctly identified to have almost no relation with the response variable. It’s not able to identify any relationship between \(x_3\) and the response variable, since the relationship is quadratic (in fact, this applies almost all other methods except for MIC). It’s also clear that while the method is able to measure the linear relationship between each feature and the response variable, it is not optimal for selecting the top performing features for improving the generalization of a model, since all top performing features would essentially be picked twice.

**Lasso** picks out the top performing features, while forcing other features to be close to zero. It is clearly useful when reducing the number of features is required, but not necessarily for data interpretation (since it might lead one to believe that features \(x_{11}…x_{13}\) do not have a strong relationship with the output variable).

**MIC** is similar to correlation coefficient in treating all features “equally”, additionally it is able to find the non-linear a relationship between \(x_3\) and the response.

**Random forest’s** impurity based ranking is typically aggressive in the sense that there is a sharp drop-off of scores after the first few top ones. This can be seen from the example where the third ranked feature has already 4x smaller score than the top feature (whereas for the other ranking methods, the drop-off is clearly not that aggressive).

**Ridge regression** forces regressions coefficients to spread out similarly between correlated variables. This is clearly visible in the example where \(x_{11}…x_{14}\) are close to \(x_1…x_4\) in terms of scores.

**Stability selection** is often able to make a useful compromise between data interpretation and top feature selection for model improvement. This is illustrated well in the example. Just like Lasso it is able to identify the top features (\(x_1\), \(x_2\), \(x_4\), \(x_5\)). At the same time their correlated shadow variables also get a high score, illustrating their relation with the response.

# Conclusions

Feature ranking can be incredibly useful in a number of machine learning and data mining scenarios. The key though is to have the end goal clearly in mind and understand which method works best for achieving it. When selecting top features for model performance improvement, it is easy to verify if a particular method works well against alternatives simply by doing cross-validation. It’s not as straightforward when using feature ranking for data interpretation, where stability of the ranking method is crucial and a method that doesn’t have this property (such as lasso) could easily lead to incorrect conclusions. What can help there is subsampling the data and running the selection algorithms on the subsets. If the results are consistent across the subsets, it is relatively safe to trust the stability of the method on this particular data and therefor straightforward to interpret the data in terms of the ranking.

Great series on feature selection! It’s direct and intuitive unlike many of the ML texts which skirt around the topic but never address it directly. The code examples are especially useful. I’ve been using mostly using linear models and random forests for feature selection, I’m glad to learn about stability selection and the others.

Really great and helpful. Could you also add a similar examples for feature selection used for classification?

Most of the examples I have are equally applicable for classification.

For example in sklearn you can use RandomForestClassifier instead of RandomForestRegressor, LogisticRegression (it includes l1 penalty option) instead of Lasso etc.

The most comprehensive article about feature selection that I have read …….Thank you !

Learn a lot! Thank you!

Thanks for you great article, very impressive. I ‘d like to translate them into Chinese version so that more people in the world can learn these articles if you agree.

Sure, go ahead. And post the link in the comments if you do, I can then add it to the article.

You have put all your expertise in creating this article. Very very helpful in many ways.

Thank you. Extremely useful article. Stability method is brilliant.

And how did you selected methods parameters? For different data some methods could give less informative results, giving too much first ranks (or vice verse, too much zeroes).

To compensate this, you can calculate mean as weightened sum with weights as average method rank.

Thank you so much for sharing, the series posts are great explanations to related feature selection algorithms! Thank you!

Great series of posts!

I think that the reason why most methods give x3 a low score isn’t that is quadratically related to the dependent variable, but the specific form of the term and the domain for X. You should see this by changing the interval to 0.5 to 1.5 (or removing the 0.5 of that term). In particular, Random Forest is a non-linear algorithm, so it should see the relation between x3 and Y regardless of the square.

Thanks!

Thanks!

Pingback: selection fea | Baruch Amoussou's blog

Awesome serie of posts!

I’ve never learned that much thing in any other post on the Internet.

Respect! Thanks a lot!

Such an awesome post. Love it, thanks!

It’s helpful, thanks a lot.

Hi, one question to ask. When i am doing feature engineering, should i perform feature selection first? Or model selection first? Thanks.

How should one deal with the potential for non linear relationships. Is taking logs an option.

Depends on the relationship. So it’s best to plot/understand the data first.

An option is to generate a lot of different transformations (log, square, sqrt) and the apply lasso to see which (transformed) features come out on top.