9

Machine Learning Cross (K-fold) Validation Introduction | by Sanjay Singh | Sanr...

 2 years ago
source link: https://medium.com/sanrusha-consultancy/machine-learning-cross-k-fold-validation-introduction-de5a385de8ec
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
neoserver,ios ssh client

Machine Learning Cross (K-fold) Validation Introduction

1*iiKm3V-rDCFDurS16QNSzw.png

Machine learning involves splitting available observations into train and test data sets, training the model on training data set and testing it on test data set.

Below figure explains the usual steps involved in Machine Learning

1*XlhF7h7uXu8WzakDiYekhg.png?q=20
null
Fig1: Machine Learning observation split

This approach works well if you have a lot of observations available to train and test your model, but it shows following weakness if observation size is small.

  1. The available observations have to be split; leaving less data to train algorithm/s and derive optimum model.
  2. There is no point in calculating accuracy on training data, especially for algorithms like decision trees.
  3. Test data has limited size (usually 20% observations are allotted to test). The calculated accuracy might not be same on non-training data (Ex. live data) and over time model might need re-training.

Cross validation (also known as K-fold validation) address these challenges to an extent.

The goal of cross validation is to validate the model several times without scarifying data available to train the model. Once the algorithm is trained and data model is optimized, the train data is split in several sample data-sets and model is validated independently on each data-set. Accuracy of each validation cycle can be averaged to estimate model accuracy in live system.

1*NQJWx19EINwSNRrUCjI5Xg.png?q=20
null
Machine Learning Cross Validation

Usually, cross validation estimates expected accuracy of optimized model on non-training data hence it provides better estimation of accuracy of the model in live application.

The benefit is that it obtains this metric using the same training data and unlike previous approach you don’t need to set aside a subset of data to calculate it.

Implementation

Let’s implement K-fold validation using python sklearn library cross_val_score .

You can get the source data from below link on Kaggle

First thing first, let’s import all the libraries

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import cross_val_score
from sklearn.pipeline import Pipeline

Import the data into Pandas Dataframe

df=pd.read_csv(r’C:\Sanjay\Machine Learning\pima-indians-diabetes-database\diabetes.csv’)
df.head()

This is what the data looks like

1*NQiYuFRnDEKSbjZJSpYGng.png?q=20
null

Define X and y and create an instance of Decision Tree classifier.

y=df[‘Outcome’]
X=df.drop([‘Outcome’], axis=1)dec_cls=DecisionTreeClassifier()
dec_cls.fit(X,y)

It’s time to do K-fold validation by calling cross_val_score. Here cv=5 indicates the observations are split in 5 sample sets.

cross_val_score(DecisionTreeClassifier(), X, y, scoring=”accuracy”, cv = 5)

Output shows five individual accuracy.

array([0.68181818, 0.68831169, 0.69480519, 0.77777778, 0.7254902 ])

Take mean of these values to estimate overall accuracy on live application

print('Estimated Accuracy {:.3f}'.format(cross_val_score(DecisionTreeClassifier(), X, y, scoring="accuracy", cv = 5).mean()))

Output:

Estimated Accuracy 0.716

Here you go.

Now you know what is K-Fold validation in Machine learning.

Reference:


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK