Pandas For Data Analysis – A Quick Guide

Filed Under: Pandas

Python pandas is an open-source library in python which is widely used for data analysis. It is robust and offers easily usable functions and go-to data structures for effective analysis. If are an analyst or a data scientist, you know very well that how invaluable pandas are.

Due to the wide range of functions, it is used in multiple domains such as finance, economics, business, and statistics. In this tutorial let’s see how pandas can be used for data analytics and how efficient it is in this process. Without wasting much time, let’s dive in!

Pandas for Data analysis

Pandas offers robust functions for data manipulation and helps in reading and writing data into different file formats.
Due to tendency towards data structures, it makes more flexible with huge labelled or relational datasets.
It helps in high performance oriented actions such as aggregation, merging, concatenating and reshaping as well.
Pandas series is the most effective data structure which helps in creating data frames in python.

Things we do here –

Load the data using read_csv().
View the data.
Get the dimensions of the data.
Summary statistics of the data.
Unique values and Crosstabs.
Data types.
Correlation among features.

Also read: How To Change Column Order Using Pandas.

Load the Data

For this tutorial, we will be working on a Housing dataset that is pretty huge and serves the purpose well. Using pandas we can load the data into python.

#load the data

import pandas as pd

data = pd.read_csv('Housing.csv')

data.head(5)

We have successfully loaded the data into python. Now let’s understand about the data and dive in for analysis.

Peek Into the Data

To understand the high-level overview of the data, pandas offers multiple functions. We are going to use the head and tail function to see the first and last n rows of the data. Similarly, we will be using Shape() and info() functions to know dimensions and information about the data.

Head and Tail()

#head of the data

data.head(5)

#tail of the data

data.tail(5)

That’s good. The head and tail functions will return the top and bottom n rows of the data. You can always specify the number of rows which should be returned.

Shape

To know the dimensions of the data, we can make use of shape() function in pandas.

#shape

data.shape

(545, 13)

That’s it. It says our data has 545 rows and 13 columns. So, now we want to see those features / variables right. Then just go for it.

#features

data.columns

Index(['price', 'area', 'bedrooms', 'bathrooms', 'stories', 'mainroad',
       'guestroom', 'basement', 'hotwaterheating', 'airconditioning',
       'parking', 'prefarea', 'furnishingstatus'],
      dtype='object')

That’s cool. Now we got all the feature names in the data. Finally, we have to understand what data is telling us. So, use info() function and get the results.

#info

data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 545 entries, 0 to 544
Data columns (total 13 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   price             545 non-null    int64  
 1   area              545 non-null    int64  
 2   bedrooms          545 non-null    int64  
 3   bathrooms         532 non-null    float64
 4   stories           539 non-null    float64
 5   mainroad          545 non-null    object 
 6   guestroom         537 non-null    object 
 7   basement          545 non-null    object 
 8   hotwaterheating   518 non-null    object 
 9   airconditioning   545 non-null    object 
 10  parking           538 non-null    float64
 11  prefarea          545 non-null    object 
 12  furnishingstatus  545 non-null    object 
dtypes: float64(3), int64(3), object(7)
memory usage: 55.5+ KB

Perfect! Here you will get an idea about the null values and the data types as well. If you want to particularly view the data types, you can make use of dtypes function.

Statistical Analysis Using Pandas

Yes. It is not enough to understand your data completely by just peeking into it. You have to use some statistical measures to dig deep into data and get meaningful insights. Let’s do it together.

Here are some of the functions which we are going to use –

Describe.
Unique
Sample
Value_counts
Correlation

Let’s see how we can use these functions and make sense out of our data.

Describe

Describe function will help us to find the statistical measures such as min and max values, mean, standard deviation and more.

#describe

data.describe()

The describe measure only consider the numerical features.

Unique

The unique function will help us to find all the unique values in the data. Let’s try it out.

#unique

data['furnishingstatus'].unique()

array(['furnished', 'semi-furnished', 'unfurnished'], dtype=object)

It says that feature – ‘furnishingstatus‘ has 3 unique values.

Sample

Sample function is used to get the random data record from the data.

#sampling

data.sample(5)

You can see the randomly sampled data values.

Value counts and Correlation

Value counts and correlation function will help us in getting the frequency of the values and correlation among the features respectively.

#Value counts

data['furnishingstatus'].value_counts()

semi-furnished 227

unfurnished 178

furnished 140

Name: furnishingstatus, dtype: int64

This tells use that most of the houses are semi-furnished.

#correlation

data.corr()

Here is the correlation among the features which ranges from +1 to -1 where the former is highly correlated and later stands for weakly correlated.

Wrapping Up – Pandas

Python pandas is an open-source and robust library that is widely used for data manipulation and analysis. In this article, I have shown many pandas’ functions which helps us in the data analysis. I hope you find this useful and don’t forget to grab some data and try it yourself.

That’s all for now. Happy Python!!!

More read: Python Pandas

Pandas For Data Analysis - A Quick Guide - JournalDev

Pandas For Data Analysis – A Quick Guide

Pandas for Data analysis

Load the Data

Peek Into the Data

Statistical Analysis Using Pandas

Wrapping Up – Pandas

Recommend

索尼将成立电动汽车公司

Oppo A96 5G leaked renders reveal dual camera, flat sides

Top stories of 2021: Q4

Mark Gurman: Apple is bringing a punch hole display to the iPhone 14

4 Easy Ways For Data Filtering In Python Pandas

腾讯捐赠2000万驰援西安

Watch the TCL CES 2022 press event live here

中国移动1月5日上市

Apple breaks $3 trillion market cap threshold

亚马逊回应Kindle缺货

About Joyk