11

How I improved the performance of my ML model from 70 to 95% | Analytics Vidhya

 2 years ago
source link: https://medium.com/analytics-vidhya/ensemble-learning-bagging-f9329e07fc22
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
neoserver,ios ssh client

How I improved the performance of my ML model from 70% to 95%

Ensemble Learning: an efficient way to improve the performance of your ML model!

0*2ORknpTyr7cLheFt
Photo by Stem List on Unsplash

In my previous blog, I explained Bias, Variance, and Irreducible errors.

Here’s the link to the blog -> Bias Variance Irreducible Error and Model Complexity Trade off

One of the techniques to reduce these errors (Bias and Variance) is Ensemble Learning. It combines several machine learning models to get optimized results with decreased variance (bagging), bias (boosting), and improved prediction (stacking).

In this blog, you are going to have hands-on practice on Ensemble Learning methods.

Data Source:

We are going to use Pima-Indians-diabetes database from the below link. Download diabetes.csv file from the below link.

The datasets consist of several medical predictor (independent) variables and one target (dependent) variable, Outcome. Independent variables include the number of pregnancies the patient has had, their BMI, insulin level, age, and so on.

Challenge:

Predict outcome (diabetic or not) based on the patient’s BMI, insulin level, age, and other feature values.

Let’s try different supervised learning methods and calculate their accuracy.

Execute below lines of code to read the data into the Pandas data frame, get feature value matrix, label array, and split train and test data set.

#Import Libraries
import pandas as pd
import numpy as np# Read data into pandas dataframe
df=pd.read_csv(r'<put your file path here>\diabetes.csv')#Define Feature Matrix (X) and Label Array (y)
X=df.drop(['Outcome'],axis=1)
y=df['Outcome']
#Define train and test data set
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.2,random_state=0)

Let’s try different classifiers and calculate their accuracy.

KNN Classifier:

from sklearn.neighbors import KNeighborsClassifierknn=KNeighborsClassifier(n_neighbors=12)
knn.fit(X_train,y_train)
y_pred_knn=knn.predict(X_test)
print("KNN Accuracy ",knn.score(X_test,y_test))

KNN Accuracy is 78%

KNN Accuracy  0.7857142857142857

Decision Tree Classifier:

from sklearn.tree import DecisionTreeClassifierdec_cls=DecisionTreeClassifier()
dec_cls.fit(X_train,y_train)
y_pred_dec=dec_cls.predict(X_test)
print("Decision Tree Accuracy ",dec_cls.score(X_test,y_test))

Decision tree classifier accuracy is about 78%

Decision Tree Accuracy  0.7792207792207793

Logistics Regression :

from sklearn.linear_model import LogisticRegression
lrc=LogisticRegression()
lrc.fit(X_train,y_train)
y_pred_log=lrc.predict(X_test)
print("Logistic Regression Accuracy ",lrc.score(X_test,y_test))

Accuracy for Logistic Regression is 81%.

Logistic Regression Accuracy  0.8181818181818182

Support Vector Machine (SVM) Classifier:

from sklearn.svm import SVC
svc_classifier=SVC(kernel="linear",random_state=0)
svc_classifier.fit(X_train,y_train)
y_pred_svc=svc_classifier.predict(X_test)
print("SVC Accuracy ",svc_classifier.score(X_test,y_test))

SVC Accuracy is about 81%

SVC Accuracy  0.8181818181818182

Voting Classifier:

We trained different models (SVM, KNN, Logistics, Decision Tree) using the same training data set and calculated individual accuracy. How about pitting these models against each other and selecting the best among them. This can be done using the VotingClassifier class from sklearn.

from sklearn.ensemble import VotingClassifier
vote_cls = VotingClassifier(estimators=[('lr', svc_classifier), ('dt', lrc),('ab',knn),('dec',dec_cls)], voting='hard')
vote_cls.fit(X_train,y_train)
y_pred_vote_cls=vote_cls.predict(X_test)
print('Voting Classifier Accuracy ', vote_cls.score(X_test,y_test))

Voting classifier accuracy is 81%

Voting Classifier Accuracy  0.8181818181818182

Make a note of voting=’hard’ option in VotingClassifier.

There are two kinds of voting: hard and soft.

a) In hard voting, the majority determines the outcome. This is like selecting a mode of individual values. We had following individual accuracy score of models

KNN Accuracy  0.7857142857142857
Decision Tree Accuracy 0.7922077922077922
SVC Accuracy 0.8181818181818182
Logistic Regression Accuracy 0.8181818181818182

The majority score is 81%. No wonder the hard voting classifier resulted into 81% accuracy.

However, make a note that hard voting classifier gets the mode of each predicated label and not overall outcome.

b) Soft voting is applicable in regression analysis or probability-based classifiers (ex. Logistic Regression). Soft voting classifier calculates the weighted average of individual outcomes.

Bagging

So far we have used different models on the same training data set, got the individual prediction, and used voting classifiers to get the best outcome.

Instead of using different models on same training data set, how about splitting the training data set into several small subsets, training a model on these data and calculating overall outcome using voting for classifier and averaging for regression. This is called Bagging.

Using bootstrap sampling, bagging creates several subsets of original training data. The split of training data into a smaller subset is done such that each subset has at least 62% unique training points.

Note: Only the overall training data set is split in smaller sets. Features are not compromised. All the features are considered in every sub-set.

Figure 1 explains Bagging.

1*dusEjsKFUw47endxl_Xhow.png?q=20
null
Fig 1: Bagging

As decision tree classifier gave maximum accuracy, let’s use the Bagging on this model.

We are going to split the training data into 25 subsets (base_estimators)

from sklearn.ensemble import BaggingClassifier#Bagging Decision Tree Classifier
#initialize base classifier
dec_tree_cls=DecisionTreeClassifier()#number of base classifier
no_of_trees=25#bagging classifier
bag_cls=BaggingClassifier(base_estimator=dec_tree_cls,n_estimators=no_of_trees,random_state=10, bootstrap=True, oob_score=True)bag_cls.fit(X_train,y_train)
bag_cls.predict(X_test)print("Bagging Classifier Accuracy ",bag_cls.score(X_test,y_test))

Accuracy has increased to 82%.

Bagging Classifier Accuracy  0.8246753246753247

As evident by this example, bagging has improved accuracy.

Let’s try bagging with the KNN classifier.

#Bagging KNN Classifier
#initialize base classifierknn_cls=KNeighborsClassifier(n_neighbors=12)#number of base classifier
no_of_trees=25#bagging classifier
bag_cls=BaggingClassifier(base_estimator=knn_cls,n_estimators=no_of_trees,random_state=10, bootstrap=True, oob_score=True)bag_cls.fit(X_train,y_train)
bag_cls.predict(X_test)print("Bagging Classifier Accuracy ",bag_cls.score(X_test,y_test))

Accuracy is 78%.

Bagging Classifier Accuracy  0.7857142857142857

In case of KNN accuracy remains same. Bagging has not improved the prediction.

Bagging brings in good improvements in classifiers like Simple Decision Tree, however, it could not improve KNN. This is because KNN is a stable model based on neighboring data points.

Random Forest:

Random forest is an enhanced version of Bagging. In case of bagging the training, data is split in several sub-set without compromising features. Each subset contains all the features.

Consider a typical Decision tree classifier. If the training data set contains 11 features, the regular Decision tree as well as Bagging classifier will contain all 11 features.

1*yL_eYP0SEzYFVtN_Dpfn1w.png?q=20
null
Regular Decision Tree Structure

In Random forest, instead of using all the features, a random subset of feature is selected in each subset of training data.

Random tree will look like the below figure.

1*oHix99l_9o-6h-a3Mk8d4Q.png?q=20
null
Random Forest

There is more than one tree (called as estimators) and each tree contains only a selected number of features.

Random forest is a fast and very effective classifier. Let’s use this for the same data set and confirm if there are any improvements.

from sklearn.ensemble import RandomForestClassifier
rnd_clf=RandomForestClassifier(n_estimators=53, n_jobs=-1, random_state=8)
rnd_clf.fit(X_train,y_train)
rnd_clf.predict(X_test)
print("Random Forest Score ",rnd_clf.score(X_test,y_test))

Accuracy score is 83%

Random Forest Score  0.8311688311688312

So, there is improvement. However, finding the number of estimators is key. The general belief is that more number of estimators merrier, but that’s not always true.

Boosting

In the case of bagging, the training data sub-set was feed to models in parallel. The outcome was decided based on the overall performance of the models on the training data set.

Boosting takes care of increasing the performance of weak learners by reducing the bias and making weak learners learn from each outcome of the previous model run on training data sub-set. Boosting follows sequential learning.

The below diagram explains boosting.

1*RcJznz3pRtWzfyPdGvcAsg.png?q=20
null
Boosting

Adaboost

Adaboost is a famous ensemble boosting classifier. It works sequentially as explained in the above figure. It starts with a random subset of training data. It iteratively trains the model by selecting the next training subset based on the prediction accuracy of the previous classification. It reduces bias, by assigning higher weight to wrong classified observations. This way in the next iteration these observations get a higher probability for classification. This iteration continues until it reaches the specified maximum number of estimators.

Let’s use Adaboost and confirm if it improves accuracy.

from sklearn.ensemble import AdaBoostClassifier
adb_cls=AdaBoostClassifier(n_estimators=153, learning_rate=1)
adb_cls.fit(X_train,y_train)
y_adb_pred=adb_cls.predict(X_test)
print("AdaBoost Classifier ",adb_cls.score(X_test,y_test))

Outcome

AdaBoost Classifier  0.8376623376623377

Not bad! It’s has improved the performance to 83%.

Gradient Boosting Model (GBM)

Gradient Boosting Model is one of the most used and most efficient ensemble model.

Gradient Boosting can be expanded as Gradient Descent + Boosting.

Gradient Descent focuses on the optimization of the loss function. It can be explained well using linear regression.

Below is the equation for linear regression.

1*OiglQRUTLtoxg3zHIMg63w.png?q=20
null

Below is the formula for loss function Mean Square Error (MSE)

1*cpVBicVGKTl6HbA05khSsQ.png?q=20
null
MSE Formula

Gradients descent focuses on finding optimal values of weight w, such that MSE is minimum.

It starts with a random value of w and calculates the impact of changing w on MSE. It keeps changing w until it finds minimum MSE as shown in the below figure.

1*3L03GBoZJXdMqM1-r6GawQ.png?q=20
null

The size of each step is called Learning Rate. The learning rate can be passed as a hyperparameter to the classifier. A high learning rate means moving fast towards the optimal point, however might also result in overshooting the lowest point and thereby missing optimal value of w. Keeping the learning rate lower mitigate this risk but it requires more CPU power, as there are more calculations involved.

Gradient Boosting focuses on optimizing residual error. It follows the boosting mechanism of the sequential learning of models. The focus is to optimize the loss function.

Run the below line of code to see if there’s any improvement using Gradient Boosting.

Here, we are passing different values of learning rate and finding the optimal value of the learning rate based on model score.

from sklearn.ensemble import GradientBoostingClassifier
lr_list = [0.05, 0.075, 0.1, 0.25, 0.5, 0.75, 1,1.25,1.55,1.65,1.75]for learning_rate in lr_list:
gb_clf = GradientBoostingClassifier(n_estimators=53, learning_rate=learning_rate, max_features=2, max_depth=2, random_state=0)
gb_clf.fit(X_train, y_train)print("Learning rate: ", learning_rate)
print("Accuracy score (training): {0:.3f}".format(gb_clf.score(X_train, y_train)))

Result:

Learning rate:  0.05
Accuracy score (training): 0.798
Learning rate: 0.075
Accuracy score (training): 0.805
Learning rate: 0.1
Accuracy score (training): 0.816
Learning rate: 0.25
Accuracy score (training): 0.853
Learning rate: 0.5
Accuracy score (training): 0.902
Learning rate: 0.75
Accuracy score (training): 0.925
Learning rate: 1
Accuracy score (training): 0.940
Learning rate: 1.25
Accuracy score (training): 0.953
Learning rate: 1.55
Accuracy score (training): 0.935
Learning rate: 1.65
Accuracy score (training): 0.938
Learning rate: 1.75
Accuracy score (training): 0.919

The accuracy can be improved to 95.3% by using the learning rate of 1.25.

This is a big improvement from 81% of base learner Decision Tree.

Awesome!!

Happy Machine learning until the next blog!

Reference:


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK