Data processing using ML Supervised classification algorithm to find accuracy

Reading Time: 5 minutes

Machine learning is an application of artificial intelligence (AI) that provides systems the ability to automatically learn and improve from experience without being explicitly programmed. Machine learning focuses on the development of computer programs that can access data and use it to learn for themselves.

Type of machine learning

Supervised learning
Unsupervised Learning
Reinforcement Learning

In Supervised Learning, algorithms learn from labeled data. After understanding the data, the algorithm determines which label should be given to new data based on pattern and associating the patterns to the unlabeled new data Which are divided into 2 types of Classification and Regression

Classification
It predicts a discrete number of values. In classification, the data is categorized under different labels according to some parameters and then the labels are predicted for the data. Classifying as a binary classification

Regression
So regression is one of the simplest supervised learning approaches that help us to try and capture relationships between the input variables, which we call features, and output variables, which are the predictions we’d like to make. So to motivate this problem and understand what those concepts mean, imagine we wanted to understand the relationship between height and weight or equivalently, we wanted to come up with a predictor, so can we predict a person’s weight if we know that height

It predicts the continuous-valued output. The Regression analysis is the statistical model which is used to predict the numeric data instead of labels It can also identify the distribution trends based on the available data or historical data. Predicting a person’s income from their age, education is an example of a regression task. Or predicting the price of a home depends on the features provided by the home.

So here in this blog, we are going to focus on classification and some classification algorithm that we are going to use is

LogisticRegression
Support Vector Machine
KNearest Neighbor
Decision Tree Classifier
Random Forest
Gaussian Naive Bayes and so on

Python library that we are going to use is pandas , numpy and scikit learn

Data set https://drive.google.com/open?id=1JHxdlmzDGQ5OvNGZSXcui1csd6nU2zK2

import pandas as pd
import numpy as np
 
# Import the data.
data = pd.read_csv('train.csv')

# Get a peek at the data.
data.head(5)

Performing Data Cleaning and Analysis

Understand the meaning of each column.

Passenger ID – Unique ID is given to each passenger in the dataset.
Survived – Passenger Survived (1) or Died(0)
Name – Passenger’s Name.
Sex – Passenger’s Sex.
Age – Passenger’s Age.
SibSp – Number of Siblings/Spouses aboard.
Parch – Number of Parent/Children aboard (Some children traveled with a nanny, therefore parch = 0 for them.)
Ticket – Ticket Number.
Fare – Fare for the ticket.
Cabin – If the passenger opted for a cabin.
Embarked – Port of embarkation (C = Cherbourg, Q = Queenstown, S = Southampton)

Note – Don’t just delete the columns because you’re not finding it useful. Our focus is not on deleting the columns but on analyzing how each column is affecting the result of the prediction and in accordance with that, deciding whether to keep the column or to delete the column or fill the null values of the column by some values and if yes, then what values.


# Data in 'Name' column can never decide the survival of a person. Hence, we can safely delete it. 
#Similarly for the 'Ticket', 'Fare' and 'PassengerId' column

del data['’Name’]
del data['Ticket']
del data['Fare']
del data['PassengerId']
Managaing null dataset
# Getting the count of NULL values in the data. 
data.isnull().sum()

Survived 0
Pclass 0
Name 0
Age 177
SibSp 0
Parch 0
Cabin 687
Embarked 2

We see that most of the rows in the cabin column are NULL. Which implements that if a person has availed a cabin facility on the ship, his/her Cabin Number is mentioned. So, we will fill the NULL values with 0 and rest everything with 1. This means, 1 for a person having a cabin and 0 for not.


# Checking the unique values in 'Cabin' column.
data['Age'].unique()

# Since the NULL values in 'Cabin' column is represents No Cabin for the passenger, we will put 0 for them.
data['Cabin'] = data['Cabin'].fillna(0)

# Filling 1 for the rest of the passengers who had cabin.
for index in data.index:
    if data['Cabin'][index]!=0:
        data['Cabin'][index]=1

# One Hot Encoding((One hot encoding the dataset to convert string values to integers) the Sex Column.
sex_encoded = pd.get_dummies(data['Sex'] )

#Appending the 'Sex_Encoded' to the original data.
data['Sex'] = sex_encoded['female']
data.head()

As there are significant changes in the survival rate based on which port the passengers aboard the ship. We cannot delete the whole embarked column(It is useful). Now the Embarked column has some null values in it and hence we can safely say that deleting some rows from total rows will not affect the result. So rather than trying to fill those null values with some vales. We can simply remove them.

# Checking what is the most number of unique values in this column and filling the missing embarked values with that. 
data['Embarked'].value_counts()

# Filling the NULL values in 'Embarked' column.
data['Embarked'] = data['Embarked'].fillna('S')

# Encoding the character values in 'Embarked' column to Numerical Values
embarked_encoded = pd.get_dummies(data['Embarked'])

# Appending the encoded values of 'Embarked' column back to the dataset.
data['Cherbourg'] = embarked_encoded['C']
data['Southampton'] = embarked_encoded['S']

# Deleting unnecessary 'Embarked' column.
del data['Embarked']
data.head()

# Filling the null values in 'Age' column with the Mean of all the values present.
value = data['Age'].mean()
data['Age'] = data['Age'].fillna(value)

Now Applying Machine Learning Algorithms to Processed Data

from sklearn import model_selection
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler

# Split-out validation dataset
# X is feature Matrix and Y is label
array = data.values
X = array[:,1:9]
Y = array[:,0]
Y=Y.astype('int')

#Split arrays or matrices into the random train and test subsets using train_test_split

validation_size = 0.20
seed = 7
X_train, X_validation, Y_train, Y_validation = model_selection.train_test_split(X,Y,test_size=validation_size, random_state=seed)

The idea behind StandardScaler is that it will transform your data such that its distribution will have a mean value 0 and standard deviation of 1. Given the distribution of the data, each value in the dataset will have the sample mean value subtracted and then divided by the standard deviation of the whole dataset.

from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()

X_train = sc_X.fit_transform(X_train) # Fit to data, then transform it
scoring = 'accuracy'

K-Folds cross-validator provides train/test indices to split data into train/test sets. Split dataset into k consecutive folds (without shuffling by default).Each fold is then used once as validation while the k – 1 remaining folds form the training set
Cross_val_score Evaluate a score by cross-validation

models = []
models.append(('LR', LogisticRegression()))
models.append(('LDA', LinearDiscriminantAnalysis()))
models.append(('KNN', KNeighborsClassifier()))
models.append(('CART', DecisionTreeClassifier()))
models.append(('NB', GaussianNB()))
models.append(('SVM', SVC()))
results = []
names = []
for name, model in models:
    kfold = model_selection.KFold(n_splits=10, random_state=seed)
    cv_results = model_selection.cross_val_score(model, X_train, Y_train, cv=kfold, scoring=scoring)
    results.append(cv_results)
    names.append(name)
    msg = "%s: Accuracy :%f " % (name, cv_results.mean()*100)
    print(msg)

Output : 

LR: Accuracy :80.340376 
LDA: Accuracy :80.624022 
KNN: Accuracy :80.491002 
CART: Accuracy :79.370110 
NB: Accuracy :77.114632 
SVM: Accuracy :83.018388

References

1) https://scikit-learn.org/stable/

2) https://pandas.pydata.org/

3) https://www.numpy.org/