6

Lasso, Ridge, Elastic Net Regression

 2 years ago
source link: https://medium.com/analytics-vidhya/advanced-regression-techniques-lasso-ridge-elastic-net-df93699101d1
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
neoserver,ios ssh client

Lasso, Ridge, Elastic Net Regression

Regression models that regularize coefficients

0*kVkHsNtW2A0aatQX
Photo by NeONBRAND on Unsplash

As data scientists, the linear regression model is the first and simplest form of regression we learn for predicting continuous outcomes. We make the following assumptions in linear regression:

a. Predictors and target variables have a linear relationship

b. Data follows a normal distribution

c. Predictors are not correlated with each other

The linear regression model works by deriving coefficient values that minimize loss values like Root Mean Square Error (RMSE). However, if coefficient values are very high it leads to overfitting. Overfitting results in very accurate predictions on the training dataset but not-so-accurate prediction on the testing and real dataset. This predicament is solved by using Regularization. Regularization penalizes high coefficients and reduces multicollinearity.

Lasso, Ridge, Elastic Nets are advanced regression techniques that use regularization to bring optimal predictions where simple linear regression doesn’t work well.

In this blog, I will take you through these advanced regression techniques.

Data Source

We are going to use the house price dataset available on Kaggle. Get the source file from here :

This data set contains 79 explanatory variables describing (almost) every aspect of residential homes in Ames, Iowa. The challenge is to predict the final price of each home. Data is available in .csv format. Store these files on your local disk.

Data Exploration and Preprocessing

Let’s review the data and understand what we got. The below code will split the data into train and test data set and print the header rows of the train data set.

Output:

1*gYedezr_YPGbYFBoZfcaEw.png?q=20
null
1*tSHLwS5tsUB4KWmQFbmtUw.png?q=20
null

The following are key findings in the data:

a. It contains 80 feature variables. There are high chances of multi-collinearity.

b. Id feature is not useful.

c. Sales Price is Label (dependent variable).

As this blog is about exploring advanced regression techniques, I am not spending much time explaining data pre-processing steps. The below code takes care of the required data pre-processing (baring multi-collinearity).

After data pre-processing X_train and X_test contain 288 features

X_train.shape

Output:

(1460, 288)

It’s not easy to handle 288 feature variables using Simple Linear Regression approaches. We will use Multiple Linear Regression and compare its performance with advanced regression techniques like Lasso, Ridge, and Elastic Net that can reduce the loss function, prioritize the features, and get rid of multicollinearity. We will use R2 Score and Root Mean Square Error (RMSE) metrics to compare the performance of these regression techniques.

RMSE is a loss function. It measures the difference between the predicted value and the actual value.

1*8gAzXmavlQHjzpuVAcMFfg.png?q=20
null
RMSE Equation

R2 score measures how close the data are to the regression line.

Multiple Linear Regression

First, let’s use classical Multiple Linear Regression.

Output:

R2 Score 0.722
RMSE Mean 0.171

LASSO Regression

LASSO stands for Least Absolute Shrinkage and Selection Operator. Lasso encourages a simple model by shrinking data towards a central point, like mean. It achieves this shrinking by regularizing the coefficient values.

Lasso regression focuses on L1 regularization. It adds a penalty equal to the absolute value of the magnitude of coefficients. Let me explain.

Below is the equation for multiple linear regression

1*TYO7G6e8kE7tXVXKHj3MAw.png?q=20
null

If we put this in the RMSE equation, it will look like below.

1*TiK1UHgLYcE3UKkK2xOp1w.png?q=20
null

Regular linear regression techniques (i,e. Simple, Multiple) try to reduce RMSE to get an optimum regression trend line.

Lasso extends this loss function. It adds Lambda (λ) and absolute coefficients total. The loss function equation for Lasso is below

1*3sP1FxZZaWxtOl2m7dKuTw.png?q=20
null

This is called L1 regularization. Getting the minimum value of this function means reducing below value

1*A3aA4oxK3U9y_Lhz0a40fQ.png?q=20
null

This part of the equation can be minimized either by low lambda value or by reducing coefficients, in other words penalizing coefficients.

Lasso reviews features and penalizes features of low importance by reducing their coefficient value. In other words, it also takes care of multicollinearity.

Output:

Lasso CV R2 Score 0.903
Lasso CV RMSE Mean 0.121

R2 score as well as RMSE has improved. Let’s understand how LASSO has achieved these improvements.

A. Feature selection: Did LASSO improve feature selection by reducing multicollinearity and penalizing related and not so important features?

Let’s find out.

Output:

Lasso picked 110 variables and eliminated the other 178 variables

Very good! LASSO got rid of 178 variables out of 288. It has considered only 110 variables. Even among these 110 variables, it has penalized few with low coefficient values. Let’s find out the top 10 and bottom 10 feature variables.

imp_coef = pd.concat([coef.sort_values().head(10),
coef.sort_values().tail(10)])
matplotlib.rcParams['figure.figsize'] = (8.0, 10.0)
imp_coef.plot(kind = "barh")
plt.title("Coefficients")
plt.show()
1*hNnf5CVS7S8y1q1H7IwZ1Q.png?q=20
null

Here you go!

Now you know the top 10 and bottom 10 feature variables LASSO considered while reducing the loss function value.

How about the distribution of residuals? Let’s visualize them.

matplotlib.rcParams[‘figure.figsize’] = (6.0, 6.0)
preds = pd.DataFrame({“preds”:model_lasso.predict(X_train), “true”:y})
preds[“residuals”] = preds[“true”] — preds[“preds”]
preds.plot(x = “preds”, y = “residuals”,kind = “scatter”)
plt.show()
1*aoLRuj-0E_p3ynWISLY9Cw.png?q=20
null

The spread of residual values looks good.

Ridge Regression

Ridge regression performs L2 regularization. Below is the loss function formula for ridge regression

1*Bld1dvOZ6J46ckbHU8KMWg.png?q=20
null

Let’s implement the Ridge technique on this dataset.

#Ridge Regression
model_ridge = Ridge(alpha=10).fit(X_train, y)
print('Ridge R2 Score {:.3f}'.format((cross_val_score(model_ridge, X_train, y, scoring="r2", cv = 10).mean())))
print('Ridge RMSE Mean {:.3f}'.format(np.sqrt(-cross_val_score(model_ridge, X_train, y, scoring="neg_mean_squared_error", cv = 10)).mean()))

Output

Ridge R2 Score 0.897
Ridge RMSE Mean 0.125

There is an improvement compared to Multiple Linear Regression, but no improvement compared to LASSO. Let’s check the features is considered.

coef = pd.Series(model_ridge.coef_, index = X_train.columns)
print("Ridge picked " + str(sum(coef != 0)) + " variables and eliminated the other " + str(sum(coef == 0)) + " variables")

Output:

Ridge picked 288 variables and eliminated the other 0 variables

Unlike LASSO, ridge regression doesn’t care about feature selection. It considered all 288 features. Maybe that’s why it could not bring any improvement compared to LASSO, in this case.

Even while considering all the feature variables, Ridge brought better results compared to Multiple Linear Regression, because it regularized the loss function.

We used K-fold validation with a cv as 10. That means the data-set was divided into 10 equal sample sizes. Let’s visualize the range of loss function calculated on these sample sizes.

lr_loss=np.sqrt(-cross_val_score(lr_model, X_train, y, scoring="neg_mean_squared_error", cv = 10))
lasso_loss=np.sqrt(-cross_val_score(LassoCV(alphas = [1, 0.1, 0.001, 0.0005],cv=10), X_train, y, scoring="neg_mean_squared_error", cv = 10))
ridge_loss=np.sqrt(-cross_val_score(Ridge(alpha=10), X_train, y, scoring="neg_mean_squared_error", cv = 10))fig, ax = plt.subplots()
X_val=x=[1,2,3,4,5,6,7,8,9,10]
ax.plot(X_val,lr_loss, color="blue",label='Linear Reg')
ax.plot(X_val,lasso_loss, color="red",label='LASSO')
ax.plot(X_val,ridge_loss, color="green",label='Ridge')
plt.xlabel('Sample')
plt.ylabel('y')
ax.legend()
ax.grid(True)
plt.show()

Output:

1*LFhKmA8D0aAqnx5sSCw-uw.png?q=20
null

It clearly shows, in all 10 cases the loss value for LASSO and Ridge remained more aligned compared to multiple linear regression.

This is the power of regularization!

Hope you find this blog useful! Look forward to your question/comments.

References:

Machine Learning Hands-on Course

Simple Linear Regression: ->https://medium.com/analytics-vidhya/simple-linear-regression-and-fun-behind-it-df509c2a057

Multiple Line Regression:-> https://medium.com/analytics-vidhya/multiple-linear-regression-7727a012ff93

Polynomial Linear Regression :-> https://medium.com/sanrusha-consultancy/polynomial-linear-regression-9d691a605aa0

KNN Regression :->https://medium.com/sanrusha-consultancy/k-nearest-neighbor-knn-regression-and-fun-behind-it-7055cf50ae56


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK