Selecting good features – Part II: linear models and regularization

In my previous post I discussed univariate feature selection where each feature is evaluated independently with respect to the response variable. Another popular approach is to utilize machine learning models for feature ranking. Many machine learning models have either some inherent internal ranking of features or it is easy to generate the ranking from the structure of the model. This applies to regression models, SVM’s, decision trees, random forests, etc.

In this post, I will discuss using coefficients of regression models for selecting and interpreting features. This is based on the idea that when all features are on the same scale, the most important features should have the highest coefficients in the model, while features uncorrelated with the output variables should have coefficient values close to zero. This approach can work well even with simple linear regression models when the data is not very noisy (or there is a lot of data compared to the number of features) and the features are (relatively) independent:

from sklearn.linear_model import LinearRegression
import numpy as np

np.random.seed(0)
size = 5000

#A dataset with 3 features
X = np.random.normal(0, 1, (size, 3))
#Y = X0 + 2*X1 + noise
Y = X[:,0] + 2*X[:,1] + np.random.normal(0, 2, size)
lr = LinearRegression()
lr.fit(X, Y)

#A helper method for pretty-printing linear models
def pretty_print_linear(coefs, names = None, sort = False):
    if names == None:
        names = ["X%s" % x for x in range(len(coefs))]
    lst = zip(coefs, names)
    if sort:
        lst = sorted(lst,  key = lambda x:-np.abs(x[0]))
    return " + ".join("%s * %s" % (round(coef, 3), name)
                                   for coef, name in lst)

print "Linear model:", pretty_print_linear(lr.coef_)

Linear model: 0.984 * X0 + 1.995 * X1 + -0.041 * X2

As you can see in this example, the model indeed recovers the underlying structure of the data very well, despite quite significant noise in the data. But actually, this learning problem was particularly well suited for a linear model: purely linear relationship between features and the response variable, and no correlations between features.

When there are multiple (linearly) correlated features (as is the case with very many real life datasets), the model becomes unstable, meaning that small changes in the data can cause large changes in the model (i.e. coefficient values), making model interpretation very difficult (so called multicollinearity problem). For example, assume we have a dataset where the “true” model for the data is \(Y = X_1 + X_2\), while we observe \(\hat{Y} = X_1 + X_2 + \epsilon\), with \(\epsilon\) being noise. Furthermore, assume that \(X_1\) and \(X_2\) are linearly correlated such that \(X_1 \approx X_2\). Ideally the learned model will be \(Y = X_1 + X_2\). But depending on the amount of noise \(\epsilon\), the amount of data at hand and the correlation between \(X_1\) and \(X_2\), it could also be \(Y = 2X_1\) (i.e. using only \(X_1\) as the predictor) or \(Y = -X_1 + 3X_2\) (shifting of the coefficients might happen to give a better fit in the noisy training set) etc.

Let’s look at the same correlated dataset as in the random forest example, with some noise added.

from sklearn.linear_model import LinearRegression

size = 100
np.random.seed(seed=5)

X_seed = np.random.normal(0, 1, size)
X1 = X_seed + np.random.normal(0, .1, size)
X2 = X_seed + np.random.normal(0, .1, size)
X3 = X_seed + np.random.normal(0, .1, size)
 
Y = X1 + X2 + X3 + np.random.normal(0,1, size)
X = np.array([X1, X2, X3]).T
 
lr = LinearRegression()
lr.fit(X,Y)
print "Linear model:", pretty_print_linear(lr.coef_)

Linear model: -1.291 * X0 + 1.591 * X1 + 2.747 * X2

Coefficients sum up to ~3, so we can expect the learned model to perform well. On the other hand, if we were to interpret the coefficients at face value, then according to the model \(X_3\) has a strong positive impact on the output variable, while \(X_1\) has a negative one. Actually all features are correlated almost equally to the output variable.

The same method and concerns apply to other similar linear methods, for example logistic regression.

Regularized models

Regularization is a method for adding additional constraints or penalty to a model, with the goal of preventing overfitting and improving generalization. Instead of minimizing a loss function \(E(X, Y)\), the loss function to minimize becomes \(E(X, Y)+ \alpha \|w\|\), where \(w\) is the vector of model coefficients, \(\|\cdot\|\) is typically L1 or L2 norm and \(\alpha\) is a tunable free parameter, specifying the amount of regularization (so \(\alpha = 0\) implies an unregularized model). For regression models, the two widely used regularization methods are L1 and L2 regularization, also called lasso and ridge regression when applied in linear regression.

L1 regularization / Lasso

L1 regularization adds a penalty \(\alpha \sum_{i=1}^n \left|w_i\right|\) to the loss function (L1-norm). Since each non-zero coefficient adds to the penalty, it forces weak features to have zero as coefficients. Thus L1 regularization produces sparse solutions, inherently performing feature selection.
For regression, Scikit-learn offers Lasso for linear regression and Logistic regression with L1 penalty for classification.

Lets run Lasso on the Boston housing dataset with a good \(\alpha\) (which can be found for example via grid search):

from sklearn.linear_model import Lasso
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import load_boston
 
boston = load_boston()
scaler = StandardScaler()
X = scaler.fit_transform(boston["data"])
Y = boston["target"]
names = boston["feature_names"]
 
lasso = Lasso(alpha=.3)
lasso.fit(X, Y)
 
print "Lasso model: ", pretty_print_linear(lasso.coef_, names, sort = True)

Lasso model: -3.707 * LSTAT + 2.992 * RM + -1.757 * PTRATIO + -1.081 * DIS + -0.7 * NOX + 0.631 * B + 0.54 * CHAS + -0.236 * CRIM + 0.081 * ZN + -0.0 * INDUS + -0.0 * AGE + 0.0 * RAD + -0.0 * TAX

We see that a number of features have coefficient 0. If we increase \(\alpha\) further, the solution would be sparser and sparser, i.e. more and more features would have 0 as coefficients.

Note however that L1 regularized regression is unstable in a similar way as unregularized linear models are, meaning that the coefficients (and thus feature ranks) can vary significantly even on small data changes when there are correlated features in the data. Which brings us to L2 regularization.

L2 regularization / Ridge regression

L2 regularization (called ridge regression for linear regression) adds the L2 norm penalty (\(\alpha \sum_{i=1}^n w_i^2\)) to the loss function.
Since the coefficients are squared in the penalty expression, it has a different effect from L1-norm, namely it forces the coefficient values to be spread out more equally. For correlated features, it means that they tend to get similar coefficients. Coming back to the example of \(Y = X_1 + X_2\), with strongly correlated \(X_1\) and \(X_2\), then for L1, the penalty is the same whether the learned model is \(Y = 1*X_1 + 1*X_2\) or \(Y = 2*X_1 + 0*X_2\). In both cases the penalty is \(2*\alpha\). For L2 however, the first model’s penalty is \(1^2 + 1^2 = 2\alpha\), while for the second model is penalized with \(2^2 + 0^2 = 4 \alpha\).

The effect of this is that models are much more stable (coefficients do not fluctuate on small data changes as is the case with unregularized or L1 models). So while L2 regularization does not perform feature selection the same way as L1 does, it is more useful for feature *interpretation*: a predictive feature will get a non-zero coefficient, which is often not the case with L1.

Lets look at the example with three correlated features again, running the example 10 times with different random seeds, to emphasize the stability of L2 regression.

from sklearn.linear_model import Ridge
from sklearn.metrics import r2_score
size = 100

#We run the method 10 times with different random seeds
for i in range(10):
    print "Random seed %s" % i
    np.random.seed(seed=i)
    X_seed = np.random.normal(0, 1, size)
    X1 = X_seed + np.random.normal(0, .1, size)
    X2 = X_seed + np.random.normal(0, .1, size)
    X3 = X_seed + np.random.normal(0, .1, size)
    Y = X1 + X2 + X3 + np.random.normal(0, 1, size)
    X = np.array([X1, X2, X3]).T


    lr = LinearRegression()
    lr.fit(X,Y)
    print "Linear model:", pretty_print_linear(lr.coef_)

    ridge = Ridge(alpha=10)
    ridge.fit(X,Y)
    print "Ridge model:", pretty_print_linear(ridge.coef_)
    print

Random seed 0 Linear model: 0.728 * X0 + 2.309 * X1 + -0.082 * X2 Ridge model: 0.938 * X0 + 1.059 * X1 + 0.877 * X2
Random seed 1 Linear model: 1.152 * X0 + 2.366 * X1 + -0.599 * X2 Ridge model: 0.984 * X0 + 1.068 * X1 + 0.759 * X2
Random seed 2 Linear model: 0.697 * X0 + 0.322 * X1 + 2.086 * X2 Ridge model: 0.972 * X0 + 0.943 * X1 + 1.085 * X2
Random seed 3 Linear model: 0.287 * X0 + 1.254 * X1 + 1.491 * X2 Ridge model: 0.919 * X0 + 1.005 * X1 + 1.033 * X2
Random seed 4 Linear model: 0.187 * X0 + 0.772 * X1 + 2.189 * X2 Ridge model: 0.964 * X0 + 0.982 * X1 + 1.098 * X2
Random seed 5 Linear model: -1.291 * X0 + 1.591 * X1 + 2.747 * X2 Ridge model: 0.758 * X0 + 1.011 * X1 + 1.139 * X2
Random seed 6 Linear model: 1.199 * X0 + -0.031 * X1 + 1.915 * X2 Ridge model: 1.016 * X0 + 0.89 * X1 + 1.091 * X2
Random seed 7 Linear model: 1.474 * X0 + 1.762 * X1 + -0.151 * X2 Ridge model: 1.018 * X0 + 1.039 * X1 + 0.901 * X2
Random seed 8 Linear model: 0.084 * X0 + 1.88 * X1 + 1.107 * X2 Ridge model: 0.907 * X0 + 1.071 * X1 + 1.008 * X2
Random seed 9 Linear model: 0.714 * X0 + 0.776 * X1 + 1.364 * X2 Ridge model: 0.896 * X0 + 0.903 * X1 + 0.98 * X2

As you can see from the example, the coefficients can vary widely for linear regression, depending on the generated data. For L2 regularized model however, the coefficients are quite stable and closely reflect how the data was generated (all coefficients close to 1).

Summary

Regularized linear models are a powerful set of tool for feature interpretation and selection. Lasso produces sparse solutions and as such is very useful selecting a strong subset of features for improving model performance. Ridge regression on the other hand can be used for data interpretation due to its stability and the fact that useful features tend to have non-zero coefficients. Since the relationship between the response variable and features in often non-linear, basis expansion can be used to convert features into a more suitable space, while keeping the simple linear models fully applicable.

Next up: Selecting good features Part III: tree based methods.

27 comments on “Selecting good features – Part II: linear models and regularization”

Ankit Agarwal on July 8, 2015 at 5:03 pm said:

In ridge regression how you had selected alpha=10 in
ridge = Ridge(alpha=10)

Reply ↓
- ando on July 9, 2015 at 8:41 am said:
  
  This is called hyperparameter optimization and can be done for example using grid search and cross-validation.
  
  Reply ↓
Nathan on May 7, 2016 at 10:44 am said:

Thank your for writing this!
About L2 regularization when you say “For correlated features, it means that they tend to get similar coefficients”, does that mean that we could/should eliminate features based on the fact they have similar coefs and thus carry the same information?

Reply ↓
- ando on May 7, 2016 at 9:30 pm said:
  
  No, you shouldnt eliminate features simply based on the fact that they have similar coefficients (two uncorrelated features could also happen to have similar coefficients). If you want to do feature elimination, use L1 and see which coefficients are driven to 0.
  
  Reply ↓
  - dkwca on August 10, 2016 at 6:06 am said:
    
    It is not strictly true that correlated features have similar coefficients either.
    In many cases, the model overfits and gives large positive weights to one of the correlated and large negative weights to one of the negative features.
    Of course this doesn’t happen if you regularize, which was exactly your point.
    
    Reply ↓
Divyansh Agarwal on June 6, 2016 at 5:47 am said:

How does one interpret the coefficients produced by ridge regression? For example, if only X1 and X2 are highly correlated, and both have coefficients 1.01 and 1.08 respectively produced by ridge regression, would it be right to say that a unit change in X1 results in a 1.01 unit change in the Y if all other variables (including X2) are held constant??

Reply ↓
- Anonymous on November 16, 2016 at 6:53 am said:
  
  If two variable are highly correlated then this interpretation is not appropriate
  
  Reply ↓
Adam on June 8, 2016 at 6:19 pm said:

Hi ..
I wanted to ask how can i identify the merged filtered features ? for example:

linear model : y= aX0 + bX1 +cX3 + d(X0+X1) + e(X0+X3) + ….

Thanks

Reply ↓
Pingback: Classification and Regression – Min Liang's blog
Pingback: Feature Engineering – Min Liang's blog
Paul on April 3, 2017 at 2:39 pm said:

Very nice article, exactly what I would need. Just a pitty that it’s not available for R.

Reply ↓
- Brian on April 21, 2017 at 8:17 pm said:
  
  This type of analysis is readily available in R, in fact the people who developed the elastic net (alpha between 0 and 1) wrote the package in R (it is called glmnet). Glmnet can perform ridge, lasso, or elastic net regression, I suggest you check it out.
  
  Reply ↓
Deniel on July 13, 2017 at 3:54 pm said:

Hi,
thnaks for your great post!
Just a quick question.
You use coefficients of a linear regression to measure how important an independent variable is to a dependent variable.
However, coefficient = r (sy/sx1), which means if the variance of x1 is big , then the coefficient is small.
But I would still say that x1 is a good feature suffice it to say that r is high.
What do you think

Reply ↓
Pingback: 结合Scikit-learn介绍几种常用的特征选择方法 – Androidev
Pingback: Bayesian Regression using PyMC3 – Site Title
Pingback: 7 tools in every data scientist’s toolbox | Premium Blog!
Srinath on March 4, 2018 at 6:54 pm said:

Can you explain the word ‘scale’ here? What do you mean by the same scale?

“This is based on the idea that when all features are on the same scale, the most important features should have the highest coefficients in the model, while features uncorrelated with the output variables should have coefficient values close to zero.”

Reply ↓
- ando on March 6, 2018 at 2:03 pm said:
  
  Check https://en.wikipedia.org/wiki/Feature_scaling
  
  Reply ↓
Pingback: Data scientist’s toolbox - Pro Analytics Expert
Sean S. on April 4, 2018 at 8:48 pm said:

Hello, this post was quite helpful.

However, I am curious as to how you map the coefficients back to the actual names of the variables you had in your dataframe. Is it safe to assume that the order of the ‘X0, X1, X2…’ variables in the final equation the function delivers is the same order as the variables were in in the original dataframe?

It seems to difficult to interpret the effect each variable is having on the prediction if you can’t map the coefficient back to the actual variable.

Thanks in advance!

Reply ↓
- ando on April 5, 2018 at 9:16 pm said:
  
  Yes, the order will be the same, i.e. first coefficient maps to first column in the data, second coefficient to the second column etc.
  
  Reply ↓
Ebrahimi on July 3, 2018 at 9:44 am said:

Hi
I would greatly appreciate if you could let me know whether I should omit highly correlated features before using Lasso (L1) to do feature selection. In fact, I want to do feature selection via Lasso. However, some of my features are highly correlated e.g., -1 or 0.9. Should I omit them before applying Lasso or let the Lasso decides this?
Really, I read in Mr. Raschka’s book (Python Machine Learning) that “regularization is a very useful method to handle collinearity (high correlation among features)”.
Thanks in advance.
Best regards,

Reply ↓
- ando on July 8, 2018 at 8:56 am said:
  
  You can use Lasso for feature selection. It will choose some of the correlated features
  
  Reply ↓
Pingback: Идентификация на основе множества признаков — Моделирование и распознавание 2D/3D образов
Laleh Asadzadeh on August 8, 2019 at 3:10 am said:

Thank you for writing this. It truly helped me.

Reply ↓
Atakan on April 29, 2020 at 5:11 am said:

Very helpful post, thanks so much. Do you have an insight on what constitutes “too much correlation” between predictors which will start to make lasso model unreliable? Is there a cut off like |rho|> 0.75 or something like that?

Reply ↓
Pingback: What is the co-linearity cutoff between variables that starts to make LASSO predictions unreliable? – GrindSkills