19

Algorithms From Scratch: Linear Regression

 4 years ago
source link: https://towardsdatascience.com/algorithms-from-scratch-linear-regression-c654353d1e7c?gi=1a72f053d5a3
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
neoserver,ios ssh client

Detailing and Building a Linear Regression model from scratch

B7VVreF.jpg!web

Jul 8 ·9min read

ZrmUJfN.jpg!web

Photo by Isaac Smith on Unsplash

Linear Regression is a popular linear Machine Learning algorithm for regression-based problems. It is usually one of the first algorithms that is learnt when first learning Machine Learning, due to its simplicity and how it builds into other algorithms like Logistic Regression and Neural Networks. In this story we are going to implement it from scratch so that we can build our intuition about what is going on within a Linear Regression model.

Link to Github Repo…

Note: There are many frameworks with highly optimized code such as Scikit-Learn, Tensorflow and PyTorch, therefore it is usually unnecessary to build your own algorithm from scratch. However, it serves a good purpose to our intuition of what is happening within the model when we build it from scratch and this helps when we attempt to improve our models performance.

A linear model is an algorithm that makes a prediction by simply computing a weighted sum of the input features plus a bias term (also referred to as the intercept term). Taking that into perspective, what we are doing when we use a linear regression model is we hope to explain the relationship between a dependent variable (i.e house price) and one or more independent variables (i.e. location, bedrooms, area, etc).

Figure 1: Multivariate Linear Regression

When we train a model we are attempting to set the parameters to get a line that best fits the training data. Therefore, when we train a Linear Regression model we are trying to find the value of theta that best minimizes the cost function . The most common cost function of a regression models is the RMSE , however, it is much easier to minimize the MSE as it leads to the same result.

Creating the Model

If you have never written a Machine Learning algorithm from scratch, I greatly encourage you to do so. John Sullivan wrote a very useful story title 6 Steps To Write Any Machine Learning Algorithm From Scratch: Perceptron Case Study w hich is the best advice I have managed to find on the internet about writing algorithms from scratch.

Chunking the Algorithm

  1. Randomly initialize parameters for the hypothesis function
  2. Calculate the Partial Derivative (Read more about this here )
  3. Update parameters
  4. Repeat 2–3 for n number of iterations (Until cost function is minimized otherwise)
  5. Inference

Implementation

For this section, I will be leveraging 3 Python packages. NumPy for Linear algebra, Scikit-Learn which is a popular Machine Learning framework and Matplotlib to visualize our data.

import numpy as np 
import matplotlib.pyplot as plt
from sklearn.datasets import make_regression
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

First, we need a dataset. To do this I will sklearn.datasets.make_regression which allows you to generate a random regression problem — see Documentation . Next, I will split my data into training and test sets with sklearn.model_selection.train_test_splitDocumentation .

# creating the data set
X, y = make_regression(n_samples=100, n_features=1, n_targets=1, noise=20, random_state=24)

# splitting training and test
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.75, random_state=24)

Lets use matplotlib.pyplot to see how our data looks — Documentation .

# visualize 
plt.scatter(X, y)
plt.show()

y2uAvuZ.png!web

Figure 2: Our generated regression problem

Now we can begin our implementation of Linear Regression. The first step of our chunk is to randomly initialize parameters for our hypothesis function.

def param_init(X): 
"""
Initialize parameters for linear regression model
__________________
Input(s)
X: Training data
__________________
Output(s)
params: Dictionary containing coefficients
"""
params = {} # initialize dictionary
_, n_features = X.shape # shape of training data

# initializing coefficents to 0
params["W"] = np.zeros(n_features)
params["b"] = 0
return params

Excellent. Next we want to calculate the partial derivatives and update our parameters. We do this using a very important Machine Learning algorithm called Gradient Descent . Ergo, we can implement steps 2–4 with Gradient Descent.

def gradient_descent(X, y, params, alpha, n_iter): 
"""
Gradient descent to minimize cost function
__________________
Input(s)
X: Training data
y: Labels
params: Dictionary contatining random coefficients
alpha: Model learning rate
__________________
Output(s)
params: Dictionary containing optimized coefficients
"""
W = params["W"]
b = params["b"]
m = X.shape[0] # number of training instances

for _ in range(n_iter):
# prediction with random weights
y_pred = np.dot(X, W) + b
# taking the partial derivative of coefficients
dW = (2/m) * np.dot(X.T, (y_pred - y))
db = (2/m) * np.sum(y_pred - y)
# updates to coefficients
W -= alpha * dW
b -= alpha * db

params["W"] = W
params["b"] = b
return params

Using these functions we can train our Linear Regression model on our training data to obtain model parameters that we will require for inference.

def train(X, y, alpha=0.01, n_iter=1000):
"""
Train Linear Regression model with Gradient decent
__________________
Input(s)
X: Training data
y: Labels
alpha: Model learning rate
n_iter: Number of iterations
__________________
Output(s)
params: Dictionary containing optimized coefficients
"""
init_params = param_init(X)
params = gradient_descent(X, y, init_params, alpha, n_iter)
return params

Now, when we run this function we would obtain the optimized weights from our training data that we will use for inference on our test data. Next, we need to create a predict function for our inference that uses our stored weights.

def predict(X_test, params):
"""
Train Linear Regression model with Gradient decent
__________________
Input(s)
X: Unseen data
params: Dictionary contianing optimized weights from training
__________________
Output(s)
y_preds: Predictions of model
"""
y_preds = np.dot(X_test, params["W"]) + params["b"]
return y_preds

Great! Let’s run these functions and plot them to see what happens…

params = train(X_train, y_train) # train model
y_preds = predict(X_test, params) # inference
plt.scatter(X_test, y_test)
plt.plot(X_test, y_preds, color="red")
plt.title("Predictions Dummy Regression Data")
plt.xlabel("X axis")
plt.ylabel("Y axis")

plt.show()

aYvQvyY.png!web

Figure 3: Predictions from custom linear regression model.

Our line of best fit seems to be pretty decent. To be entirely sure of our implementation we are fortunate enough to have many Machine Learning libraries, with optimized code, to compare our implementation to. To compare, I am simply going to check the RMSE of our implementation vs Scikit-learn.

Note: It may also be worthwhile to plot their implementation on the same plot and if their line of best fit covers yours then you are on the right tracks.
lin_reg = LinearRegression()
lin_reg.fit(X_train, y_train)
sklearn_y_preds = lin_reg.predict(X_test)
print(f"My implementation: {np.sqrt(mean_squared_error(y_test, y_preds))}\nSklearn implementation: {np.sqrt(mean_squared_error(y_test, y_preds))}")>>>> My implementation: 20.986105292320207
Sklearn implementation: 20.986105292320207

It’s a match!

For this implementation we used procedural programming which can become quite complex and cumbersome to maintain and does not maximize the potential of Python which is an Object oriented Programming (OOP) language. On that note, here is an OOP implementation of our Linear Regression model.

class LinReg(): 
"""
Custom made Linear Regression class
"""
def __init__(self, alpha=0.01, n_iter= 1000):
self.alpha = alpha
self.n_iter = n_iter
self.params = {}

def param_init(self, X_train):
"""
Initialize parameters for linear regression model
__________________
Input(s)
X: Training data
"""
_, n_features = self.X.shape # shape of training data

# initializing coefficents to 0
self.params["W"] = np.zeros(n_features)
self.params["b"] = 0
return self


def gradient_descent(self, X_train, y_train):
"""
Gradient descent to minimize cost function
__________________
Input(s)
X: Training data
y: Labels
params: Dictionary contatining random coefficients
alpha: Model learning rate
__________________
Output(s)
params: Dictionary containing optimized coefficients
"""
W = self.params["W"]
b = self.params["b"]
m = X_train.shape[0]

for _ in range(self.n_iter):
# prediction with random weights
y_pred = np.dot(X_train, W) + b
# taking the partial derivative of coefficients
dW = (2/m) * np.dot(X_train.T, (y_pred - y_train))
db = (2/m) * np.sum(y_pred - y_train)
# updates to coefficients
W -= self.alpha * dW
b -= self.alpha * db

self.params["W"] = W
self.params["b"] = b
return self

def train(self, X_train, y_train):
"""
Train Linear Regression model with Gradient decent
__________________
Input(s)
X: Training data
y: Labels
alpha: Model learning rate
n_iter: Number of iterations
__________________
Output(s)
params: Dictionary containing optimized coefficients
"""
self.params = param_init(X_train)
gradient_descent(X_train, y_train, self.params , self.alpha, self.n_iter)
return self

def predict(self, X_test):
"""
Train Linear Regression model with Gradient decent
__________________
Input(s)
X: Unseen data
params: Dictionary contianing optimized weights from training
__________________
Output(s)
y_preds: Predictions of model
"""
y_preds = np.dot(X_test, self.params["W"]) + self.params["b"]
return y_preds

Let’s call this class and make it make a prediction on our test data…

linreg = LinReg()
linreg.train(X_train, y_train)
linreg.predict(X_test)>>>> 
array([   4.73888182,  -90.06369632,   80.39799712,   66.76983607,
        -49.97207144,   93.77905208,   34.30778991,  -38.2209702 ,
         78.03331698,   53.81416352,  102.96993005,  151.71946744,
         95.52801857,  104.82707085,   98.0492089 ,   45.05150211,
         -7.29917923,  -78.41675446,  -27.14118529,  -98.52923336,
        170.75840972, -106.22126739,   24.86194847,  -21.39127805,
         50.24074837])

As a sanity check we will test if the predictions are the same as our procedural implementation (since we know this is similar to the scikit-learn implementation already).

linreg.predict(X_test) == y_preds>>>> 
array([ True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True])

And there you have it!

Note: The problem we used as an example is simple linear regression problem (Univariate Linear Regression), y = WX + b, where W is the weight and b is the bias.

Assumptions

Before deciding that we want to use Linear Regression it is important to understand the assumptions that the model makes of our data so that we can perform the necessary feature engineering to align with our model.

For more on feature engineering, see my previous story on this topic…

Linear Relationship— Linear regression only captures linear relationships, ergo assumes that there is a linear relationship between the features and the target that we want to predict. We can check the validity of this by plotting scatter graphs of the features in relation to the target variable.

Multicollinearity — Refers to a situation in which two or more explanatory variables in a multiple regression model are highly related (Source: Wikipedia). When we implement Linear Regression we are assuming that there is little to no multicollinearity since when present it weakens the statistical power of the regression model. To identify whether multicollinearity is present in our data we can use a correlation matrix to see which features are highly correlated and remove one of the highly correlated features.

Homoscedasticity — In statistics, a sequence (or a vector) of random variables is homoscedastic if all its random variables have the same finite variance (Source: Wikipedia). Linear Regression assumes that the noise is the same (may also be referred to as error term or random disturbance) across all observations and does not depend on the values of the independent variables. We may use a scatter plot of the residual values against the predicted values to check of homoscedasticity. We should expect no patterns in the distribution because if there is a pattern then the data is heteroscedastic.

Multivariate Normality— We assume the variables we use for Linear Regression follow the Normal Distribution . We can check this using a Histogram or a QQ-plot and use a non-linear transformation (i.e. log-transformation) to change the distribution of the data to fit our assumption.

Autocorrelation —The correlation of a signal with a delayed copy of itself as a function of delay — Informally, it is the similarity between observations as a function of the time lag between them (Source: Wikipedia). To put it simply Autocorrelation occurs when the residuals are not independent from each other. We may use a scatter plot to visually check for autocorrelation or the Durbin-Watsons’s test which test the null hypothesis that the residuals are not linearly autocorrelated.

Pros

  • Easy to implement
  • Easy to interpret the output coefficients
  • Less complex

Cons

  • Sensitive to Outliers
  • Assumes linear relationship between independent and dependent variables
  • Assumes independence between features

For more about Linear Regression you can read the Wikipedia page.

Wrap Up

Linear Regression is one of the easiest Machine Learning models and most likely the first you would learn or should learn. It is a very useful tool to analyze the relationship between variables, but due to the many assumptions of the model and simplification of real-world problems, it is not recommended for most practical applications.

Thank you for taking the time to go through this story. If there is anything that I have missed or you’d like me to clarify, leave a response in the comments. Additionally, If you’d like to get in contact with me, I am most accessible on LinkedIn.


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK