Pandas For Data Analysis - A Quick Guide - JournalDev
source link: https://www.journaldev.com/55404/pandas-data-analysis
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
Pandas For Data Analysis – A Quick Guide
Python pandas is an open-source library in python which is widely used for data analysis. It is robust and offers easily usable functions and go-to data structures for effective analysis. If are an analyst or a data scientist, you know very well that how invaluable pandas are.
Due to the wide range of functions, it is used in multiple domains such as finance, economics, business, and statistics. In this tutorial let’s see how pandas can be used for data analytics and how efficient it is in this process. Without wasting much time, let’s dive in!
Pandas for Data analysis
- Pandas offers robust functions for data manipulation and helps in reading and writing data into different file formats.
- Due to tendency towards data structures, it makes more flexible with huge labelled or relational datasets.
- It helps in high performance oriented actions such as aggregation, merging, concatenating and reshaping as well.
- Pandas series is the most effective data structure which helps in creating data frames in python.
Things we do here –
- Load the data using
read_csv()
. - View the data.
- Get the dimensions of the data.
- Summary statistics of the data.
- Unique values and Crosstabs.
- Data types.
- Correlation among features.
Also read: How To Change Column Order Using Pandas.
Load the Data
For this tutorial, we will be working on a Housing dataset that is pretty huge and serves the purpose well. Using pandas we can load the data into python.
#load the data
import
pandas as pd
data
=
pd.read_csv(
'Housing.csv'
)
data.head(
5
)
We have successfully loaded the data into python. Now let’s understand about the data and dive in for analysis.
Peek Into the Data
To understand the high-level overview of the data, pandas offers multiple functions. We are going to use the head and tail function to see the first and last n rows of the data. Similarly, we will be using Shape() and info() functions to know dimensions and information about the data.
Head and Tail()
#head of the data
data.head(
5
)
#tail of the data
data.tail(
5
)
That’s good. The head and tail functions will return the top and bottom n rows of the data. You can always specify the number of rows which should be returned.
Shape
To know the dimensions of the data, we can make use of shape() function in pandas.
#shape
data.shape
(545, 13)
That’s it. It says our data has 545 rows and 13 columns. So, now we want to see those features / variables right. Then just go for it.
#features
data.columns
Index(['price', 'area', 'bedrooms', 'bathrooms', 'stories', 'mainroad', 'guestroom', 'basement', 'hotwaterheating', 'airconditioning', 'parking', 'prefarea', 'furnishingstatus'], dtype='object')
That’s cool. Now we got all the feature names in the data. Finally, we have to understand what data is telling us. So, use info() function and get the results.
#info
data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 545 entries, 0 to 544 Data columns (total 13 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 price 545 non-null int64 1 area 545 non-null int64 2 bedrooms 545 non-null int64 3 bathrooms 532 non-null float64 4 stories 539 non-null float64 5 mainroad 545 non-null object 6 guestroom 537 non-null object 7 basement 545 non-null object 8 hotwaterheating 518 non-null object 9 airconditioning 545 non-null object 10 parking 538 non-null float64 11 prefarea 545 non-null object 12 furnishingstatus 545 non-null object dtypes: float64(3), int64(3), object(7) memory usage: 55.5+ KB
Perfect! Here you will get an idea about the null values and the data types as well. If you want to particularly view the data types, you can make use of dtypes
function.
Statistical Analysis Using Pandas
Yes. It is not enough to understand your data completely by just peeking into it. You have to use some statistical measures to dig deep into data and get meaningful insights. Let’s do it together.
Here are some of the functions which we are going to use –
- Describe.
- Unique
- Sample
- Value_counts
- Correlation
Let’s see how we can use these functions and make sense out of our data.
Describe
Describe
function will help us to find the statistical measures such as min and max values, mean, standard deviation and more.
#describe
data.describe()
The describe measure only consider the numerical features.
Unique
The unique
function will help us to find all the unique values in the data. Let’s try it out.
#unique
data[
'furnishingstatus'
].unique()
array(['furnished', 'semi-furnished', 'unfurnished'], dtype=object)
It says that feature – ‘furnishingstatus‘ has 3 unique values.
Sample
Sample
function is used to get the random data record from the data.
#sampling
data.sample(
5
)
You can see the randomly sampled data values.
Value counts and Correlation
Value counts
and correlation
function will help us in getting the frequency of the values and correlation among the features respectively.
#Value counts
data[
'furnishingstatus'
].value_counts()
semi-furnished 227
unfurnished 178
furnished 140
Name: furnishingstatus, dtype: int64
This tells use that most of the houses are semi-furnished.
#correlation
data.corr()
Here is the correlation among the features which ranges from +1 to -1 where the former is highly correlated and later stands for weakly correlated.
Wrapping Up – Pandas
Python pandas is an open-source and robust library that is widely used for data manipulation and analysis. In this article, I have shown many pandas’ functions which helps us in the data analysis. I hope you find this useful and don’t forget to grab some data and try it yourself.
That’s all for now. Happy Python!!!
More read: Python Pandas
Recommend
About Joyk
Aggregate valuable and interesting links.
Joyk means Joy of geeK