Data Preparation Using Pandas In Python

Many times, we think the analysis is all about getting the data, load into preferable platforms, and using some tools, we plot some graphs and metrics, which are used in decision making. But, it is not the case ever.

Data Preparation Using Python Pandas

The data we get will be messy and there will be a lot of inconsistency over that. The need for data preparation will flash here. In this article, we will discuss some of the key data preparation methods using python pandas.

1. Load the Data

For this entire discussion, we will be usingthe Titanic dataset due to its universality. You can find plenty of datasets in Kaggle, don’t forget to check it out.

More read: 4 Easy Ways For Data Filtering In Python Pandas

#load the data

import pandas as pd

data = pd.read_csv('Titanic.csv')

Our data is now ready to be prepared. You can peak into the data using the function data.head().

2. Explore the Basics

Before we move into preparation, we have to explore the data. It can be its shape, attributes, summary, datatype, and more.

Shape

#Shape of the data

data.shape

(891, 12)

Attributes

#attributes

data.columns

Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='object')

Summary

#summary - statistical

data.describe()

Datatypes

#datatypes

data.dtypes

PassengerId      int64
Survived         int64
Pclass           int64
Name            object
Sex             object
Age            float64
SibSp            int64
Parch            int64
Ticket          object
Fare           float64
Cabin           object
Embarked        object
dtype: object

As you can observe, we are using a very simple function to explore the data. These are just one-liners. You can also define a custom function with python code.

In the next steps, we will be focusing on preparing the data for the final analysis.

3. Drop Irrelevant Attributes

Not all the attributes in the data will be useful for analysis. So we need to drop them before you dive into analysis.

In our data, the attributes such as PassengerId, Ticket, and Name can be dropped straight away. Let’s do it.

#define the columns

columns = ['PassengerId','Name','Ticket']

#drop them

data.drop(columns, inplace = True, axis = 1)

You can cross-check that the data has dropped using the function data.columns().

#recheck

data.columns

Index(['Survived', 'Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Cabin',
       'Embarked'],
      dtype='object')

Perfect!. We have dropped the irrelevant now.

4. Duplicate check

Entry of the same data record more than once can be considered as a duplicate entry. We often encounter this issue with the datasets. So, we have to check for duplicates in the very initial phase of analysis, so that you do no need to worry later.

#duplicate

data.drop_duplicates(inplace = True)

You can confirm the same using the shape of the data.

#recheck

data.shape

(784, 9)

It seems there are no duplicate values in our data and it is good to see this always.

5. Missing Values

The missing values or the NaN’s can cause many issues while working with different modules and algorithms. So, you need to either replace them or fill them with a relevant value such as the mean of the values.

First, we have to check for the presence of missing values.

#missing values

data.isnull().sum()

Survived      0
Pclass        0
Sex           0
Age         106
SibSp         0
Parch         0
Fare          0
Cabin       581
Embarked      2
dtype: int64

Oh! Unfortunately, we have many missing values in our data. Let’s see how we can tackle them.

Replace

Missing values can be defined in multiple ways in the data. Somewhere you can see a blank space or it can even be denoted as NA or with a dash.

So, to get uniformity over the missing values, we have to replace all of them with ‘NA’ values. So that, later we can fill that NA value with the most relevant value.

#repalce

data['Age'].replace('NaN')

0      22.0
1      38.0
2      26.0
3      35.0
4      35.0
       ... 
886    27.0
887    19.0
888     NaN
889    26.0
890    32.0
Name: Age, Length: 891, dtype: float64

You can replace the values over the entire data as well. For that you can use this line of the code – data.repalce('NaN').

Fill NA’s

To fill the missing values, we will be using the mean of the values. Frist, you have to find the mean of the values in a particular attribute, and then, we call to fill that values in place of NA’s.

#find mean

round(data['Age'].mean(),2)

29.7

We have got the mean of the values. Now, let’s fill this value in place of missing values in the ‘Age’ attribute.

#fill NA's

data['Age'].fillna(29.7, inplace=True)

That’s perfect! You have no missing values in the ‘Age’ attribute now. You can do it for the rest of the data using the same method.

6. Pivoting The Data

By pivoting the data we can uncover a lot of hidden insights over the raw data. Let’s see how we can make use of the pivot function in pandas to get this done.

#pivoting

data = pd.pivot_table(data,values= 'Survived', index= 'Sex', columns= 'Pclass')

Basically, here we wanted to uncover survivability based on sex and class. Let’s see what pivot offers to us.

You are awesome!

As we can see from the above-pivoted table, it is clear that females who are in the Pclass 1 and 2 are the luckiest ones as the survivability in those classes is nearly 100%.

All the above methods and functions will help you a lot in preparing your data in your analysis journey. I hope you will find them easy and handy at times.

Conclusion – Data Preparation

Data preparation is an essential phase of any data analysis project. If you were able to get your data right, then it will unleash many hidden mysteries in it. There will be many lost stories in that data, make sure you heard them right. In this journey, I hope the above methods and functions will assist you to the make the process easy and effective.

That’s all for now. Happy Python!!!

More read: Python pandas for data wrangling.

Data Preparation Using Python Pandas

1. Load the Data

2. Explore the Basics

3. Drop Irrelevant Attributes

4. Duplicate check

5. Missing Values

6. Pivoting The Data

Conclusion – Data Preparation

Recommend

Data Analyst Learning Roadmap For Aspirants In 2022

Russia Prepares Roadmap For Bitcoin, Crypto Regulation: Report

Python For Finance - Top 5 Libraries To Learn - JournalDev

Wikipedia Scraping In Python - Easy Ways - JournalDev

深圳机场：2月1日24时起，进入航站楼须持48小时核酸检测阴性证明

深圳：自2月1日24时起离深须持有48小时内核酸检测阴性证明

2月1日全国预计发送旅客1170万人次

春节档总票房破14亿

杭州今日0-14时新增6例本土新冠肺炎确诊病例

RCEP生效满月，广西凭祥铁路口岸进出口货物同比增长超六成

About Joyk