Machine Learning Cross (K-fold) Validation Introduction

Machine learning involves splitting available observations into train and test data sets, training the model on training data set and testing it on test data set.

Below figure explains the usual steps involved in Machine Learning

Fig1: Machine Learning observation split

This approach works well if you have a lot of observations available to train and test your model, but it shows following weakness if observation size is small.

The available observations have to be split; leaving less data to train algorithm/s and derive optimum model.
There is no point in calculating accuracy on training data, especially for algorithms like decision trees.
Test data has limited size (usually 20% observations are allotted to test). The calculated accuracy might not be same on non-training data (Ex. live data) and over time model might need re-training.

Cross validation (also known as K-fold validation) address these challenges to an extent.

The goal of cross validation is to validate the model several times without scarifying data available to train the model. Once the algorithm is trained and data model is optimized, the train data is split in several sample data-sets and model is validated independently on each data-set. Accuracy of each validation cycle can be averaged to estimate model accuracy in live system.

Machine Learning Cross Validation

Usually, cross validation estimates expected accuracy of optimized model on non-training data hence it provides better estimation of accuracy of the model in live application.

The benefit is that it obtains this metric using the same training data and unlike previous approach you don’t need to set aside a subset of data to calculate it.

Implementation

Let’s implement K-fold validation using python sklearn library cross_val_score .

You can get the source data from below link on Kaggle

Pima Indians Diabetes Database

Predict the onset of diabetes based on diagnostic measures

www.kaggle.com

First thing first, let’s import all the libraries

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import cross_val_score
from sklearn.pipeline import Pipeline

Import the data into Pandas Dataframe

df=pd.read_csv(r’C:\Sanjay\Machine Learning\pima-indians-diabetes-database\diabetes.csv’)
df.head()

This is what the data looks like

Define X and y and create an instance of Decision Tree classifier.

y=df[‘Outcome’]
X=df.drop([‘Outcome’], axis=1)dec_cls=DecisionTreeClassifier()
dec_cls.fit(X,y)

It’s time to do K-fold validation by calling cross_val_score. Here cv=5 indicates the observations are split in 5 sample sets.

cross_val_score(DecisionTreeClassifier(), X, y, scoring=”accuracy”, cv = 5)

Output shows five individual accuracy.

array([0.68181818, 0.68831169, 0.69480519, 0.77777778, 0.7254902 ])

Take mean of these values to estimate overall accuracy on live application

print('Estimated Accuracy {:.3f}'.format(cross_val_score(DecisionTreeClassifier(), X, y, scoring="accuracy", cv = 5).mean()))

Output:

Estimated Accuracy 0.716

Here you go.

Now you know what is K-Fold validation in Machine learning.

Reference:

Machine Learning Cross (K-fold) Validation Introduction | by Sanjay Singh | Sanr...

Machine Learning Cross (K-fold) Validation Introduction

Implementation

Pima Indians Diabetes Database

Predict the onset of diabetes based on diagnostic measures

Machine Learning Hands-on Course

Join the most comprehensive Machine Learning Hands-on Course, because now is the time to get started! From basic…

End to End Machine Learning

Sanrusha is a leading provider of Machine Learning and AI based solutions. We strive to make life better by using AI.

Recommend

EVM中循环的成本是多少

Machine Learning Model Deployment as REST API in Four Easy Steps

How I improved the performance of my ML model from 70 to 95% | Analytics Vidhya

Logistic Regression Explained

在没有abi文件的情况下调用智能合约方法，web3py实现

Numpy: Heart of scientific computing in Python | Sanrusha

Predict success rate of your marketing campaign using Logistics Regression

No-code user-generated FAQ, for any webpage

It’s easy to surpass a predecessor

Lasso, Ridge, Elastic Net Regression

About Joyk