11

Pandas For Data Analysis - A Quick Guide - JournalDev

 2 years ago
source link: https://www.journaldev.com/55404/pandas-data-analysis
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
neoserver,ios ssh client

Pandas For Data Analysis – A Quick Guide

Filed Under: Pandas

Python pandas is an open-source library in python which is widely used for data analysis. It is robust and offers easily usable functions and go-to data structures for effective analysis. If are an analyst or a data scientist, you know very well that how invaluable pandas are.

Due to the wide range of functions, it is used in multiple domains such as finance, economics, business, and statistics. In this tutorial let’s see how pandas can be used for data analytics and how efficient it is in this process. Without wasting much time, let’s dive in!


Pandas for Data analysis

  • Pandas offers robust functions for data manipulation and helps in reading and writing data into different file formats.
  • Due to tendency towards data structures, it makes more flexible with huge labelled or relational datasets.
  • It helps in high performance oriented actions such as aggregation, merging, concatenating and reshaping as well.
  • Pandas series is the most effective data structure which helps in creating data frames in python.

Things we do here –

  • Load the data using read_csv().
  • View the data.
  • Get the dimensions of the data.
  • Summary statistics of the data.
  • Unique values and Crosstabs.
  • Data types.
  • Correlation among features.

Also read: How To Change Column Order Using Pandas.


Load the Data

For this tutorial, we will be working on a Housing dataset that is pretty huge and serves the purpose well. Using pandas we can load the data into python.

#load the data
import pandas as pd
data = pd.read_csv('Housing.csv')
data.head(5)
Housing Data

We have successfully loaded the data into python. Now let’s understand about the data and dive in for analysis.


Peek Into the Data

To understand the high-level overview of the data, pandas offers multiple functions. We are going to use the head and tail function to see the first and last n rows of the data. Similarly, we will be using Shape() and info() functions to know dimensions and information about the data.

Head and Tail()

#head of the data
data.head(5)
Housing Data 1
#tail of the data
data.tail(5)
Housing Tail

That’s good. The head and tail functions will return the top and bottom n rows of the data. You can always specify the number of rows which should be returned.

Shape

To know the dimensions of the data, we can make use of shape() function in pandas.

#shape
data.shape
(545, 13)

That’s it. It says our data has 545 rows and 13 columns. So, now we want to see those features / variables right. Then just go for it.

#features
data.columns
Index(['price', 'area', 'bedrooms', 'bathrooms', 'stories', 'mainroad',
       'guestroom', 'basement', 'hotwaterheating', 'airconditioning',
       'parking', 'prefarea', 'furnishingstatus'],
      dtype='object')

That’s cool. Now we got all the feature names in the data. Finally, we have to understand what data is telling us. So, use info() function and get the results.

#info
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 545 entries, 0 to 544
Data columns (total 13 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   price             545 non-null    int64  
 1   area              545 non-null    int64  
 2   bedrooms          545 non-null    int64  
 3   bathrooms         532 non-null    float64
 4   stories           539 non-null    float64
 5   mainroad          545 non-null    object 
 6   guestroom         537 non-null    object 
 7   basement          545 non-null    object 
 8   hotwaterheating   518 non-null    object 
 9   airconditioning   545 non-null    object 
 10  parking           538 non-null    float64
 11  prefarea          545 non-null    object 
 12  furnishingstatus  545 non-null    object 
dtypes: float64(3), int64(3), object(7)
memory usage: 55.5+ KB

Perfect! Here you will get an idea about the null values and the data types as well. If you want to particularly view the data types, you can make use of dtypes function.


Statistical Analysis Using Pandas

Yes. It is not enough to understand your data completely by just peeking into it. You have to use some statistical measures to dig deep into data and get meaningful insights. Let’s do it together.

Here are some of the functions which we are going to use –

  • Describe.
  • Unique
  • Sample
  • Value_counts
  • Correlation

Let’s see how we can use these functions and make sense out of our data.

Describe

Describe function will help us to find the statistical measures such as min and max values, mean, standard deviation and more.

#describe
data.describe()
Describe Analysis Pandas

The describe measure only consider the numerical features.

Unique

The unique function will help us to find all the unique values in the data. Let’s try it out.

#unique
data['furnishingstatus'].unique()
array(['furnished', 'semi-furnished', 'unfurnished'], dtype=object)

It says that feature – ‘furnishingstatus‘ has 3 unique values.

Sample

Sample function is used to get the random data record from the data.

#sampling
data.sample(5)
Sample Analysis Pandas

You can see the randomly sampled data values.

Value counts and Correlation

Value counts and correlation function will help us in getting the frequency of the values and correlation among the features respectively.

#Value counts
data['furnishingstatus'].value_counts()
semi-furnished    227
unfurnished       178
furnished         140
Name: furnishingstatus, dtype: int64

This tells use that most of the houses are semi-furnished.

#correlation
data.corr()
Pandas Correlation

Here is the correlation among the features which ranges from +1 to -1 where the former is highly correlated and later stands for weakly correlated.


Wrapping Up – Pandas

Python pandas is an open-source and robust library that is widely used for data manipulation and analysis. In this article, I have shown many pandas’ functions which helps us in the data analysis. I hope you find this useful and don’t forget to grab some data and try it yourself.

That’s all for now. Happy Python!!!

More read: Python Pandas


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK