Data Preprocessing using Python

This article will take you through the basic concepts of Data Preprocessing and implement them using python. We’ll be starting from the basics so if you have no prior knowledge about machine learning or data preprocessing, no need to worry!

Use the ipynb file available here to follow along on the implementation that I have performed below. Everything including the dataset is present in the repository.

Let’s begin!

What is Data Preprocessing?

Data Preprocessing is the process of making data suitable for use while training a machine learning model. The dataset initially provided for training might not be in a ready-to-use state, for e.g. it might not be formatted properly, or may contain missing or null values.

Solving all these problems using various methods is called Data Preprocessing, using a properly processed dataset while training will not only make life easier for you but also increase the efficiency and accuracy of your model.

Steps in Data Preprocessing:

In this article, We’ll be covering the following steps:

Importing the libraries
Importing the dataset
Taking care of missing data
Encoding categorical data
Normalizing the data
Splitting the data into test and train

Steps of data processing

Want to read this story later? Save it in Journal.

Step 1: Importing the libraries

In the beginning, we’ll import three basic libraries which are very common in machine learning and will be used every time you train a model

NumPy:- it is a library that allows us to work with arrays and as most machine learning models work on arrays NumPy makes it easier
matplotlib:- this library helps in plotting graphs and charts, which are very useful while showing the result of your model
Pandas:- pandas allows us to import our dataset and also creates a matrix of features containing the dependent and independent variable.

Step 2: Importing the dataset

The data that we’ll be using can be viewed and downloaded from here.

Sample dataset that we’ll be using

As you can see in the above image we are using a very simple dataset that contains information about customers who have purchased a particular product from a company.

It contains various information about the customers like their age, salary, country, etc. It also shows whether a particular customer has purchased the product or not.

It also contains a null value in the fifth row.

Let’s begin by importing the data.

As the given data is in CSV format, we’ll be using the read_csv function from the pandaslibrary.

Now we’ll show the imported data. You must remember the data imported using the read_csv function is in a Data Frame format, we’ll later convert it into NumPy arrays to perform other operations and training.

In any dataset used for machine learning, there are two types of variables:

Independent variable
Dependent variable

The independent variable is the columns that we are going to use to predict the dependent variable, orin other words, the independent variable affects the dependent variable

Independent and Dependent variable

In our dataset, the country, age, and salary column are the independent variable and will be used to predict the purchased column which is the dependent variable.

Step 3: Handling the missing values

As you can see in our dataset we have two missing values one in the Salary column in the 5th Row and another in the Age column of the 7th row.

Missing Values

Now there are multiple ways to handle missing values, one of them is to ignore them and delete the entire entry/row, this is commonly done in datasets containing a very large number of entries, where the missing values only constitute 0.1% of the total data. Thus they affect the model negligibly and can be removed.

But in our case, the dataset is very small and we cannot just ignore those rows. So we use another method, in which we take the mean of the entire column containing the missing values(in our case the age or salary column) and replace the missing values with that mean.

To perform this process we will use SimpleImputer classfrom the ScikitLearn library

Code for the Python implementation is given below:

Here the “missing_values = np.nan” means that we are replacing missing values and “strategy = ‘mean’ ” means that we are replacing the missing value with the mean of that column.

You can see that we have only selected the column with numerical data, as the mean can only be calculated on numerical data.

After running the above code you’ll get the following output:

Output after replacing missing values with mean

As you can observe all the missing values have been replaced by the mean of the column.

Step 4: Encoding categorical data

In our case, we have two categorical columns, the country column, and the purchased column.

OneHot Encoding

In the country column, we have three different categories: France, Germany, Spain. We can simply label France as 0, Germany as 1, and Spain as 2 but doing this might lead our machine learning model to interpret that there is some correlation between these numbers and the outcome.

So to avoid this, we apply OneHot Encoding

OneHot Encoding consists of turning the country column into three separate columns, each column consists of 0s and 1s. Therefore each country will have a unique vector/code and no correlation between the vectors and outcome can be formed.

You’ll understand more about it when we implement it below:

To perform this encoding we use OneHotEncoder and ColumnTransformer class from the same ScikitLearn library.

The ColumnTransformer class allows us to select the column to apply encoding on and leave the other columns untouched.

Note: The new columns created will be added in the front of the data frame and the original column will be deleted.

After performing the above implementation you’ll get the following output:

New columns created after OneHot encoding

Now we can see that each country has got a unique vector or code, for example, France is 1 0 0, Spain 0 0 1, and Germany 0 1 0.

Label Encoding

In the last column, i.e. the purchased column, the data is in binary form meaning that there are only two outcomes either Yes or No. Therefore here we need to perform Label Encoding.

In this case, we use LabelEncoder class from the same ScikitLearn library.

We use ‘data.iloc[:,-1]’ to select the index of the column we are transforming.

After performing this our data will look something like this:

Label Encoding

As you can see the purchased column has been successfully transformed.

Now we have completed the encoding of all the categorical data in our dataset and can move to the next step.

Step 5: Normalizing the dataset

Feature scaling is bringing all of the features on the dataset to the same scale, this is necessary while training a machine learning model because in some cases the dominant features become so dominant that the other ordinary features are not even considered by the model.

When we normalize the dataset it brings the value of all the features between 0 and 1 so that all the columns are in the same range, and thus there is no dominant feature.

Now to normalize the dataset we use MinMaxScaler class from the same ScikitLearn library.

The implementation of MinMaxScaler is very simple:

After running the above code our data set will look something like this:

Dataset after scaling

As you can see in the above image all the values in the dataset are now between 0 and 1, so there are no dominant features, and all features will be considered equally.

Note: Feature scaling is not always necessary and only required in some machine learning models.

Step 6: Splitting the dataset

Before we begin training our model there is one final step to go, which is splitting of the testing and training dataset. In machine learning, a larger part of the dataset is used to train the model, and a small part is used to test the trained model for finding out the accuracy and the efficiency of the model.

Now before we begin splitting the dataset we need to separate the dependent and independent variables which we have already discussed above in the article.

The last (purchased) column is the dependent variable and the rest are independent variables, so we’ll store the dependent variable in ‘y’ and the independent variables in ‘X’.

Another important part we need to remember is that while training the model accepts data as arrays so it is necessary that we convert the data to arrays. We do that while separating the dependent and independent variables by adding .valueswhile storing data in ‘X’ and ‘y’.

After running the above code our data will look something like this:

X and y

Now let’s split the dataset between Testing data and Training data.

To do this we’ll be using the train_test_split class from the same ScikitLearn library.

Deciding the ratio between testing data and training data is up to us and depends on what we are trying to achieve with our model. In our case, we are going to go with an 80-20% split between the train-test data. So 80% training and 20% testing data.

Here the test_size = 0.2 signifies that we have selected 20% of data as testing data, you can change that according to your choice.

After this, the X_train and X_test variables will have their respective data.

Now our data is finally ready for training!!

Data Preprocessing using Python

Data Preprocessing using Python

What is Data Preprocessing?

Steps in Data Preprocessing:

Step 1: Importing the libraries

Step 2: Importing the dataset

Step 3: Handling the missing values

Step 4: Encoding categorical data

Step 5: Normalizing the dataset

Step 6: Splitting the dataset

Recommend

Case study: Uber vs Lyft

Everything about GANs and some projects

European Voluntary Service: A recipe for a successful Gap Year

The reasonability of misinformation

Polyline Animation

5 Takeaways From Working Remotely at a Startup During a Pandemic

Availability with Redis

UX Case Study: Sub-Buddy-Subscription Management app.

What DE&I means to Pathrise

I started training as a therapist, and this is what I learnt | by Felicity Peel...

About Joyk