6

Support Vector Machine (SVM). Machine Learning for Complete Beginners.

 3 years ago
source link: https://medium.datadriveninvestor.com/support-vector-machine-svm-machine-learning-for-complete-beginners-7657854a2780
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
neoserver,ios ssh client

Support Vector Machine (SVM). Machine Learning for Complete Beginners.

A comprehensive guide to starting practicing Machine Learning (ML) in Python for complete beginners with hands-on examples. Learn ML and Upgrade yourself from a complete beginner → to ML practitioner. Explaining the SVM algorithm.

Photo by Kalineri on Unsplash

Intro

The Support Vector Machine algorithm (or SVM) is a classification algorithm that classifies cases by separating them one from another.

The point of SVM is to separate the data into classes by finding a boundary/separator.

In SVM, data points that are on the one side from that separator belong to one class, and those points on the other side belong to another class.

One might wonder: if we want to classify unknown cases why do not use a classification algorithm that works by classifying the data points directly (such as K-Nearest Neighbors) instead of classifying them by separation?

There are several things to say:

On the one hand, sometimes domain-specific data cannot be classified using a specific classification algorithm and other algorithms should be used instead. In the case of K-NN, there are several cases when it might fail. For example:

  • When one or a few data points are far away from the other points (so the distance to those neighbors is large)
  • While K-NN classification works well in low-dimension feature space, it fails in high-dimension one. This is called the curse of dimensionality. You can think about it as the memory overflow — the number of features (and consequently distances) increases exponentially as we introduce extra dimensions.
  • When the data look random, so we have to add other features for the classification. This unavoidably would lead to the previous bullet point.

On the other hand, the SVM algorithm has the following advantages:

  • Is effective for high-dimensional feature space
  • Has many different kernel functions
  • Works for both Classification and Regression

How does SVM work?

The algorithm works in the following way:

  • map data to a high-dimensional feature space
  • find an optimal separator (boundary)

A separator is not necessarily a line. It is a line only in 2D feature space but in 3D space, it is a plane, while in higher dimensional (ND)space — a hyperplane.

For visualization, see the following image. Here, we see two different classes shown in different colors (left) and transformed to the higher dimensional space data (right). Any new data point would be associated with one of these classes based on its relative position to a hyperplane.

Challanges of SVM algorithm

As we understood, this algorithm can work with high-dimension data, so in the world of Big Data with hundreds of features, it might be a good choice for the classification algorithm, right?

However, we know that free cheese is only in the mousetrap. There are a few challenges with this algorithm as well:

  1. How to transform data to a higher dimensional feature space in the first place?
  2. How to find an optimal separation hyperplane?

Without further ado, let us talk about each of these challenges one by one:

1. Transforming data to a high-D space

Before describing a formal definition of the process, I would like to draw a simple example after which you will have an intuitive feeling of what is going on under the hood:

Suppose, we have a 1-D dataset (features X and labels Y) but the features are not linearly separable.

Visual exampl of not linearly separable data.
Visual exampl of not linearly separable data.
Made by Author. Not linearly separable data.

So, we have to change/transform these data to make them linearly separable, such as shown here:

Transformation of the data to a higher-dimensional space, which is called kernelling. In this case, we have performed square transformation x-> x*x
Transformation of the data to a higher-dimensional space, which is called kernelling. In this case, we have performed square transformation x-> x*x
Made by Author. Transformation of the data to a higher-dimensional space, which is called kernelling.

Mathematically, this process is equivalent to mapping. Here, we have mapped X space to X*X (X square) space. Even visually, we can draw a line in the right panel in the Figure above. This means that data is linearly separable.

Any new/unknown case would be assigned to one of these classes depending on where it falls — below or above this separator.

In this case, we applied a square-function F to perform this transformation, as following

Y = F[X, X*X]

A similar process may be performed on higher dimensional data as well (it is just a bit more challenging to draw it by hand!). Now, let us talk about a conventional name for this process.

The formal definition of data transformation

Mapping data into a higher dimensional space is called kernelling.

A mathematical function that performs data transformation to a higher dimensional space is called a kernel function.

In our previous example, that kernel function was square
F[X, X*X].

Kernel functions can be of different types, such as Linear, Polynomial, Sigmoid, and Radial Basis Function (RBF).

Before coming to a second challenge of the SVM algorithm implementation, I would like to mention a few more important points about these functions.

  1. These kernel functions have their own mathematical definitions (equations) and characteristics. You do not really have to know the equations to use them because they are already implemented in some of the Python libraries such as sk-learn. But, I would still strongly encourage you to at least take a look at their definition because in my opinion, understanding something is much more valuable than using a library/algorithm as a black box!
  2. There is no way to know which function would perform better on a given dataset, so try them out and chose one that gives a higher accuracy!

2. Finding an optimal hyperplane

Alright, after transforming our data into a higher dimension, a natural question arises: How to choose an optimal hyperplane/separator?

A reasonable choice of a hyperplane is one, that separates the two classes with the largest margin/distance.

To see what I mean, let us take a look at the following illustration:

An illustration of the largest margin/separation between the two classes. The red arrow represents this margin, while the green line shows the most optimal hyperplane.
An illustration of the largest margin/separation between the two classes. The red arrow represents this margin, while the green line shows the most optimal hyperplane.
Made by Author. An illustration of the largest margin/separation between the two classes. The red arrow represents this margin, while the green line shows the most optimal hyperplane.

In other words, the bigger the red arrow, the better. So, how do we find the optimal hyperplane?

Origin of the SVM name

The closest data points to the separator hyperplane are called support vectors. It is clear that only the support vectors are important to create this hyperplane. So, this algorithm is called Support Vector Machine (SVM) to accommodate this fact.

An optimal hyperplane has the maximal distance to support vectors.

The hyperplane is defined mathematically via the equation that looks very similar to the equation for the line y = a * x + b but more generally, the hyperplane is defined as

wT * X+ b = 0

where w and b are the parameters of the hyperplane (T means transposed). The equations of the borderlines (green dashed lines in the Figure above) are defined as:

wT * X+ b = 1
wT * X+ b = -1

The goal is to find these parameters that would minimize the following quantity:

P = 1/2 * wT * w, for all (xi, yi): yi *(wT * xi + b) >= 1

which means that every possible support vector with the coordinates (xi, yi), that can define a hyperplane via the equation yi *(wT * xi + b) makes sure that these hyperplanes are contained between these support vectors.

Now, we are not going to randomly draw the hyperplanes and hope that one of them will be the optimal one! This is an optimization problem, so it can be solved with gradient descent*.

* Gradient descent is a first-order iterative optimization algorithm for finding a local minimum of a differentiable function.

Advantages VS. Disadvantages of the SVM

Before we go through a hands-on example of using the SVM algorithm, let us go through its advantages and disadvantages. As I mentioned earlier, there is no single algorithm to rule them all. Each has its pros and cons, so the SVM too:

Advantages of the SVM

  • Accurate in the high-dimensional feature space (in comparison, for example, to the K-NN algorithm);
  • Efficient with memory (because it uses only the support vectors under the hood).

Disadvantages of the SVM

  • Prone to overfitting! (especially, when a number of features/dimensions of the space are much greater than a number of data points/samples);
  • Do not provide a probability estimation (a hyperplane is a hyperplane without any percentage/probability of how accurately it is drawn);
  • Not efficient with large datasets (usually, for more than 1000 rows).

Here, one might ask: “Alright, Ruslan, I see the point of the SVM algorithm but it seems like there are more disadvantages than advantages of using it. What do you say about this?”. I would answer: “Dear friend, let us take a look at the applications of the SVM algorithm, i.e., when do we need to use it.”

Applications of the SVM

  • Text mining and text category assignment;
  • Gene expression classification;
  • Speech recognition;
  • Image analysis (classification and image recognition);
  • Spam detection;
  • Sentiment analysis *(or opinion mining);
  • Regression;
  • Clustering;
  • Outlier detection.

* Sentiment analysis is a natural language processing technique used to determine whether data is positive, neutral, or negative

As you can see, when we are dealing with high-dimensional data, we use SVM. I hope this answers your question.

Let us do some practical example, because

Practice makes progress.

Next, we will see how to use the SVM algorithm with the Sckit-learn Python library.

Hands-on Example

Getting a dataset

I am going to show how to classify mushrooms as edible and poisonous types. In principle, one should get one’s own data set from the real world. However, this is a tutorial, and we do not want to go out right now to collect mushrooms, right?

Suppose that we have done all the work: went out to the wild, collected the mushrooms, gave them labels, and put them into a dataset. To make our life easier, let us simply go to the Machine Learning Repository to get a dataset.

Let us pick up, for the sake of example, the dataset of different kinds of mushrooms.

Coding

Let us start with importing libraries

import numpy as np
import pandas as pdfrom sklearn import neighbors, metrics
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder

Numpy, pandas, and Scikit-learn are great when working with data. In short:

  • NumPy. If you are working with arrays, dictionaries, functions, data types, and images, you need to know NumPy. (official page);
  • Pandas. This library has many functions for data importing, manipulation and analysis. If you are working with data sets, this library is a must. (official page);
  • Scikit-learn. This library collects simple and efficient tools for predictive data analysis and Machine Learning. (official page).

Next, we need to open our dataset.

# Loading dataset
data = pd.read_csv('agaricus-lepiota.data')#, header=None)# print(data.head()) # first 5 raws
# print(data.shape)# Let us manually add the features/columns names
data.columns = ['class', 'cap-shape', 'cap-surface', 'cap-color', \
'bruises?', 'odor', 'gill-attachment',
'gill-spacing' ,'gill-size', 'gill-color',
'stalk-shape', 'stalk-root', 'stalk-surface-above-ring',
'stalk-surface-below-ring', 'stalk-color-above-ring',
'stalk-color-below-ring', 'veil-type', 'veil-color',
'ring-number', 'ring-type', 'spore-print-color',
'population', 'habitat']
data.head()

As an output, we see the first 5 rows of this dataset:

Mushroom dataset as a Data Frame.
Mushroom dataset as a Data Frame.
Image by Author. Mushroom dataset as a Data Frame.

Now, this dataset is not very large but normally you might be working with Big Data maybe with hundreds of thousands of columns and parameters. In these cases, it is better to reduce the size of the dataset for ease of work.

I have already performed an analysis of this data set with the K-NN algorithm earlier. Precisely, I have selected the features for our sub-sample, transformed the strings (features and labels) into numbers to ease the life of our algorithm. So, I am going to put all the code in one block of code — for a detailed description of each step, please look here.

# selecting sub-sample:
X = data[['cap-shape', 'cap-surface', 'cap-color',
'odor', 'gill-color', 'veil-color', 'spore-print-color']].values
# print(X)# defining our encoder
Lbl_Enc_X = LabelEncoder()
for col in range(len(X[0])):
X[:, col] = Lbl_Enc_X.fit_transform(X[:, col])

# print(X)# importing labels from our dataset
y = data[['class']]# dictionary
Lbl_Map = {
'e': 0,
'p':1
}
# actual convertion
y['class_2'] = y['class'].map(Lbl_Map)
print(y)

As a result, we got two columns called class and class_2

Mushroom classes/labels — edible and poisonous.
Image by Author. Mushroom classes/labels — edible and poisonous.

For our modeling, we need only class_2, so let us use it.

Train-test split

We want our model to predict unknown (so-called out-of-sample) cases. Such as, if we find some random mushroom in the forest, we want to input the parameters of this mushroom (size, shape, color, odor, etc.) into our trained model and get output telling us whether we are going to have this mushroom for dinner tonight or not.

We will perform a nice trick, called train-test split.

How does it work? Train-test split divides our dataset into a training set and a testing set. We then use training data to train our model and then using our testing set to test the model.

In code, it looks like this

X_train, X_test, y_train, y_test = train_test_split(X, y['class_2'], test_size = 0.2)

Note: We chose our training set to be 80%, while our testing set is remaining 20%.

Creating our SVM model

With sk-learn, it is pretty easy to create an SVM model. Let us import the model from this library and then initialize it as an object model_svm.

# importing our model from sklearn library
from sklearn import svm# initializing this model
model_svm = svm.SVC()

Note: We did not pass any parameters in our model. For a moment it should be alright because we are interested in different kernels that we will check in the very end. The default kernel should be ‘rbf’.

Before we can make any predictions, we need to train this model:

model_svm.fit(X_train, y_train)
print(model_svm)

As we discussed earlier, we use our training set to train the data, while test data for testing/making predictions on out-of-sample data:

predictions = model_svm.predict(X_test)

Now, we have our predictions! Basically, at this point, we are done.

The only thing left is to see what we predicted and to compare these predictions to the actual values. In other words, to get an accuracy of the model:

# accuracy calculation
accuracy = metrics.accuracy_score(y_test, predictions)

Here, the y_test contains the test labels (edible/poisonous) and the predictions variable contains predicted labels.

Let us see the accuracy and a number of selected labels (real and predicted)

print("Predictions = {}".format(predictions[10:15]))
print("Real labels = {}".format(y_test[10:15]))
print("Accuracy is {}".format(accuracy))

An output is something like that:

The accuracy of our SVM model based on a comparison of real and predicted labels.
The accuracy of our SVM model based on a comparison of real and predicted labels.
Made by Author. The accuracy of our SVM model, based on a comparison of real and predicted labels.

As we see, the accuracy, in this case, is pretty good — around 96% !

Testing Different Kernel Functions

After running the help() method on the SVM model, we can actually see all the possible parameters we can pass.

Here, we are interested in different kernel functions, namely:
linear, poly, rbf, and sigmoid.

So, let us repeat the procedure of the model initialization, fitting, and prediction for each of the kernels. We will use a simple for loop:

# Let us iterate over every kernel and compute the accuracy of prediction for comparisonkernels = ('linear', 'poly', 'rbf', 'sigmoid')for kernel in kernels:

# create the model
model_svm_kern = svm.SVC(kernel=kernel)
# fit the data
model_svm_kern.fit(X_train, y_train)
# make predictions
predictions_kern = model_svm_kern.predict(X_test)
# compute accuracy
accuracy_kern = metrics.accuracy_score(y_test, predictions_kern)

# printing out accuracy score for each kernel:
print("\nKernel {}".format(kernel))
print("Accuracy = {}\n".format(accuracy_kern))

As a result, we can see that our default rbf kernel performed way better than other functions.

Various kernel functions with their accuracy scores for this mushroom dataset.
Various kernel functions with their accuracy scores for this mushroom dataset.
Made by Author. Various kernel functions with their accuracy scores for this mushroom dataset. In this case, the RBF kernel gives the highest accuracy.

It is natural to go with rbf in this case. But other data sets might behave differently, so it is a good idea to check for these possibilities.

I intend to keep this tutorial short, simple, and easy to follow, so let us finish here.

The full code is available here.

Summary

The goal of this article was to give an overview of the Support Vector Machine (SVM) Machine Learning Classification algorithm with a practical example.

We covered the definitions, the under-the-hood processes of this algorithm, its challenges, as well as, its advantages and disadvantages.

After that, we went through the practical example of using the SVM model to classify the mushrooms as edible and poisonous and checked the different kernel functions.

We selected some features from the main dataset, performed a train-test split, and determined the accuracy of our predictions based on the test subset. As a result, we got around 96% accuracy.

In other words, if we find a new mushroom in the forest, determine its parameters/features (colors, shape, odor, etc.), and make a prediction based on our trained model, then with a probability of 96% this mushroom would be correctly determined as edible or poisonous.

Want to learn more?

If you are interested in brushing up your knowledge of Python, here is a brief tutorial: from Hello Wolrd to Functions. As I mentioned earlier, here are a few of my projects explained from the beginning until the end. It can be a great opportunity to try to practice some of the Python skills to enhance your experience:

Thank you for reading until the end. I hope you have learned something new. If you would like to connect or have any questions or suggestions, I will be happy to hear from you.

Contact me:

Check out my GitHub.

Connect with me on LinkedIn.

References

  • Machine Learning for Complete Beginners. Introduction. [link]
  • Free mushroom data set. [link]
  • Scikit-learn documentation on SVM. [link]

About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK