Having a good understanding of feature selection/ranking can be a great asset for a data scientist or machine learning practitioner. A good grasp of these methods leads to better performing models, better understanding of the underlying structure and characteristics of the data and leads to better intuition about the algorithms that underlie many machine learning models.

There are in general two reasons why feature selection is used:

1. Reducing the number of features, to reduce overfitting and improve the generalization of models.

2. To gain a better understanding of the features and their relationship to the response variables.

These two goals are often at odds with each other and thus require different approaches: depending on the data at hand a feature selection method that is good for goal (1) isn’t necessarily good for goal (2) and vice versa. What seems to happen often though is that people use their favourite method (or whatever is most conveniently accessible from their tool of choice) indiscriminately, especially methods more suitable for (1) for achieving (2).

At the same time, feature selection is not particularly thoroughly covered in machine learning or data mining textbooks, partly because they are often looked at as natural side effects of learning algorithms that don’t require separate coverage.

In this blog post I’ll try to cover some of the more popular approaches for feature selection and their pros, cons and gotchas along with code samples in Python and scikit-learn. I’ll also run the methods side-by-side on a sample dataset, which should highlight some of the major differences between them. In all my examples, I concentrate on regression datasets, but most of the discussion and examples are equally applicable for classification datasets and methods.

# Univariate feature selection

Univariate feature selection examines each feature individually to determine the strength of the relationship of the feature with the response variable. These methods are simple to run and understand and are in general particularly good for gaining a better understanding of data (but not necessarily for optimizing the feature set for better generalization). There are lot of different options for univariate selection.

## Pearson Correlation

One of the simplest method for understanding a feature’s relation to the response variable is Pearson correlation coefficient, which measures linear correlation between two variables. The resulting value lies in [-1;1], with -1 meaning perfect negative correlation (as one variable increases, the other decreases), +1 meaning perfect positive correlation and 0 meaning no linear correlation between the two variables.

It’s fast and easy to calculate and is often the first thing to be run on the data. Scipy‘s pearsonr method computes both the correlation and p-value for the correlation, roughly showing the probability of an uncorrelated system creating a correlation value of this magnitude.

import numpy as np from scipy.stats import pearsonr np.random.seed(0) size = 300 x = np.random.normal(0, 1, size) print "Lower noise", pearsonr(x, x + np.random.normal(0, 1, size)) print "Higher noise", pearsonr(x, x + np.random.normal(0, 10, size))

`Lower noise (0.71824836862138386, 7.3240173129992273e-49)`

Higher noise (0.057964292079338148, 0.31700993885324746)

As you can see from the example, we compare a variable with a noisy version of itself. With smaller amount of noise, the correlation is relatively strong, with a very low p-value, while for the noisy comparison, the correlation is very small and furthermore, the p-value is high meaning that it is very likely to observe such correlation on a dataset of this size purely by chance.

Scikit-learn provides f_regrssion method for computing the p-values (and underlying F-values) in batch for a set of features, so it is a convenient way to run a corre1ation test on a dataset in one go and for example include it in a sklearn’s pipeline.

One obvious drawback of Pearson correlation as a feature ranking mechanism is that it is only sensitive to a linear relationship. If the relation is non-linear, Pearson correlation can be close to zero even if there is a 1-1 correspondence between the two variables.

For example, correlation between \(x\) and \(x^2\) is zero, when \(x\) is centered on 0.

x = np.random.uniform(-1, 1, 100000) print pearsonr(x, x**2)[0]

-0.00230804707612

For more examples similar to the above, see the following sample plots.

Furthermore as illustrated by Anscombe’s quartet, relying only on the correlation value on interpreting the relationship of two variables can be highly misleading, so it is always worth plotting the data.

## Mutual information and maximal information coefficient (MIC)

Another, more robust option for correlation estimation is mutual information, defined as

\(I(X,Y) = \sum_{y \in Y} \sum_{x \in X}

p(x,y) \log{ \left(\frac{p(x,y)}{p(x)\,p(y)}

\right) }, \,\!

\)

which measures mutual dependence between variables, typically in bits.

It can be inconvenient to use directly for feature ranking for two reasons though. Firstly, it is not a metric and not normalized (i.e. doesn’t lie in a fixed range), so the MI values can be incomparable between two datasets. Secondly, it can be inconvenient to compute for continuous variables: in general the variables need to be discretized by binning, but the mutual information score can be quite sensitive to bin selection.

Maximal information coefficient is a technique developed to address these shortcomings. It searches for optimal binning and turns mutual information score into a metric that lies in range [0;1]. In python, MIC is available in the minepy library.

Looking back at the \(y = x^2\) example, MIC finds that the mutual information is 1, i.e. maximal.

from minepy import MINE m = MINE() x = np.random.uniform(-1, 1, 10000) m.compute_score(x, x**2) print m.mic()

1.0

There has been some critique about MIC’s statistical power, i.e. the ability to reject the null hypothesis when the null hypothesis is false. This may or may not be a concern, depending on the particular dataset and its noisiness.

## Distance correlation

Another robust, competing method of correlation estimation is distance correlation, designed explicitly to address the shortcomings of Pearson correlation. While for Pearson correlation, the correlation value 0 does not imply independence (as we saw from the \(x\) vs \(x^2\) example), distance correlation of 0 does imply that there is no dependence between the two variables.

Distance correlation is available for example in R’s energy package (and there’s also a Python gist).

#R-code > x = runif (1000, -1, 1) > dcor(x, x**2) [1] 0.4943864

There are at least two reasons why to prefer Pearson correlation over more sophisticated methods such as MIC or distance correlation when the relationship is close to linear. For one, computing the Pearson correlation is much faster, which may be important in case of big datasets. Secondly, the range of the correlation coefficient is [-1;1] (instead of [0;1] for MIC and distance correlation). This can relay useful extra information on whether the relationship is negative or positive, i.e. do higher feature values imply higher values of the response variables or vice versa. But of course the question of negative versus positive correlation is only well-posed if the relationship between the two variables is monotonic.

## Model based ranking

Finally one can use an arbitrary machine learning method to build a predictive model for the response variable using each individual feature, and measure the performance of each model. In fact, this is already put to use with Pearson’s correlation coefficient, since it is equivalent to standardized regression coefficient that is used for prediction in linear regression. If the relationship between a feature and the response variable is non-linear, there are a number of alternatives, for example tree based methods (decision trees, random forest), linear model with basis expansion etc. Tree based methods are probably among the easiest to apply, since they can model non-linear relations well and don’t require much tuning. The main thing to avoid is overfitting, so the depth of tree(s) should be relatively small, and cross-validation should be applied.

Here’s univariate selection using sklearn’s random forest regressor on Boston housing price data set, a sample which includes housing prices in suburbs of Boston together with a number of key attributes.

from sklearn.cross_validation import cross_val_score, ShuffleSplit from sklearn.datasets import load_boston from sklearn.ensemble import RandomForestRegressor #Load boston housing dataset as an example boston = load_boston() X = boston["data"] Y = boston["target"] names = boston["feature_names"] rf = RandomForestRegressor(n_estimators=20, max_depth=4) scores = [] for i in range(X.shape[1]): score = cross_val_score(rf, X[:, i:i+1], Y, scoring="r2", cv=ShuffleSplit(len(X), 3, .3)) scores.append((round(np.mean(score), 3), names[i])) print sorted(scores, reverse=True)

[(0.636, 'LSTAT'), (0.59, 'RM'), (0.472, 'NOX'), (0.369, 'INDUS'), (0.311, 'PTRATIO'), (0.24, 'TAX'), (0.24, 'CRIM'), (0.185, 'RAD'), (0.16, 'ZN'), (0.087, 'B'), (0.062, 'DIS'), (0.036, 'CHAS'), (0.027, 'AGE')]

## Summary

Univariate feature selection is in general best to get a better understanding of the data, its structure and characteristics. It can work for selecting top features for model improvement in some settings, but since it is unable to remove redundancy (for example selecting only the best feature among a subset of strongly correlated features), this task is better left for other methods. Which brings us to the next topic: Selecting good features Part II: Linear models and regularization.

“The resulting value lies in [-1;1], with -1 meaning perfect negative correlation (as one variable increases, the other increases)”.

There appears to be a mistake.

I feel like negative correlation is “one variable increases, the other decreases”.

Thank you!

Thanks for pointing that out, fixed now.

Thanks for your comprehensive comparison between feature selection techniques. Recently I work with Fisher score for selecting features, and I am not sure about whether to scale the features prior to the selection. It would be greatly appreciated if you help me on this: http://stats.stackexchange.com/questions/153532/normalization-before-feature-selection-such-as-fisher-score-f-score

Furthermore, my goal is to find few representative features for interpretation. According to your article, stability selection would be a good choice of protect the model from overfitting and feature interpretation. However, as I desire to have few features (e.g., one or two) out of up to 500 for interpretation, does it make sense to employ Lasso or random forest, which might remove features with high relevant to target class while highly correlated with others (which is bad for interpretation as described in the article)? I also consider to combine those advanced techniques with Fisher score as preprocessing step, e.g., Fisher score + random forest/Lasso + SVM, the purpose of the second step is to further reduce dimensionality for feature interpretation.

Pingback: 7 tools in every data scientist’s toolbox | Diving into data

This is very helpful. Thanks!

Thank you so much. Your posting very helpful

Thanks for this great series; How can we apply the methods discussed above to a classification problem?

Thank you So Much.It was an helpful article.

Keep Up

Pingback: Diving into data | azunaiml

Thank you so so much! I’m reading all your series of posts and they are amazing! the world need more people like you!

I’m writing my thesis and I want to cite you. Do you have any paper, report, or thesis?

Pingback: Classification and Regression – Min Liang's blog

Pingback: Feature Engineering – Min Liang's blog