12

20 Pandas Functions That Will Boost Your Data Analysis Process

 4 years ago
source link: https://mc.ai/20-pandas-functions-that-will-boost-your-data-analysis-process/
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
neoserver,ios ssh client

20 Pandas Functions That Will Boost Your Data Analysis Process

Explained with examples.

Photo by Heng Films on Unsplash

Pandas is a predominantly used python data analysis library. It provides many functions and methods to expedite the data analysis process. What makes pandas so common is its functionality, flexibility, and simple syntax.

In this post, I will explain 20 pandas functions with examples. Some of them are so common that I’m sure you have used before. Some of them might be new for you. But, all of them will add value to your data analysis process.

Let’s start.

import numpy as np
import pandas as pd

1. Query

We sometimes need to filter a dataframe based on a condition or apply a mask to get certain values. One easy way to filter a dataframe is query function. Let’s first create a sample dataframe.

values_1 = np.random.randint(10, size=10)
values_2 = np.random.randint(10, size=10)
years = np.arange(2010,2020)
groups = ['A','A','B','A','B','B','C','A','C','C']df = pd.DataFrame({'group':groups, 'year':years, 'value_1':values_1, 'value_2':values_2})df

It is very simple to use query function which only requires to write the condition for filtering.

df.query('value_1 < value_2')

2. Insert

When we want to add a new column to a dataframe, it is added at the end by default. However, pandas offers the option to add the new column in any position using insert function.

We need to specify the position by passing an index as first argument. This value must be an integer. Column indices start from zero just like row indices. The second argument is column name and the third argument is the object that includes values which can be Series or an array-like object.

#new column
new_col = np.random.randn(10)#insert the new column at position 2
df.insert(2, 'new_col', new_col)df

3. Cumsum

The dataframe contains some yearly values of 3 different groups. We may only be interested in yearly values but there are some cases in which we also need a cumulative sum. Pandas provides an easy-to-use function to calculate cumulative sum which is cumsum .

If we only apply cumsum, groups (A, B, C) will be ignored. This kind of cumulative values may be useless in some cases because we are not able to distinguish between groups. Don’t worry! There is a very simple and convenient solution for this issue. We can apply groupby and then cumsum function.

df['cumsum_2'] = df[['value_2','group']].groupby('group').cumsum()df

4. Sample

Sample method allows you to select values randomly from a Series or DataFrame . It is useful when we want to select a random sample from a distribution.

sample1 = df.sample(n=3)
sample1

We specify the number of values with n parameter but we can also pass a ratio to frac parameter. For instance, 0.5 will return half of the rows.

sample2 = df.sample(frac=0.5)
sample2

To obtain reproducible samples, we can use random_state parameter. If an integer value is passed to random_state, the same sample will be produced every time the code is run.

5. Where

“Where” is used to replace values in rows or columns based on a condition. The default replacement values is NaN but we can also specify the value to be put as a replacement.

df['new_col'].where(df['new_col'] > 0 , 0)

The way “where” works is that values that fit the condition are selected and the remaining values are replaced with the specified value. where(df[‘new_col’]>0, 0) selects all the values in “new_col” that are greater than 0 and the remaining values are replaced with 0. Thus, where can also be considered as a mask operation.

One important point is that “ where ” for Pandas and NumPy are not exactly the same. We can achieve the same result but with slightly different syntax. With DataFrame.where , the values that fit the condition are selected as is and the other values are replaced with the specified value. Np.where requires to also specify the value for the ones that fit the condition. The following two lines return the same result:

df['new_col'].where(df['new_col'] > 0 , 0)np.where(df['new_col'] < 0, df['new_col'], 0)

6. Isin

We use filtering or selecting methods a lot when working with dataframes. Isin method is kind of an advanced filtering. For example, we can filter values based on a list of selections.

years = ['2010','2014','2017']
df[df.year.isin(years)]

7. Loc and iloc

Loc and iloc are used to select rows and columns.

  • loc: select by labels
  • iloc: select by positions

locis used to select data by label. The labels of columns are the column names. We need to be careful about row labels. If we do not assign any specific indices, pandas created integer index by default. Thus, the row labels are integers starting from 0 and going up. The row positions that are used with iloc are also integers starting from 0.

Selecting first 3 rows and first 2 columns with iloc:

Selecting first 3 rows and first 2 columns with loc:

Note: Upper boundaries of indices are included when loc is used whereas they are excluded with iloc.

Selecting rows “1”, “3”, “5” and columns “year” and “value_1”:

8. Pct_change

This function is used to calculate the percent change through the values in a series. Consider we have a series that contains [2,3,6]. If we apply pct_change to this series, the returned series will be [NaN, 0.5, 1.0]. There is 50% increase from the first element to the second and 100% from the second to the third one. Pct_change function is useful in comparing the percentage of change in a time series of elements.

df.value_1.pct_change()

9. Rank

Rank function assigns rank to the values. Assume we have a series s that contains [1,7,5,3]. The ranks assigned to these values will be [1,4,3,2].

df['rank_1'] = df['value_1'].rank()
df

10. Melt

Melt is used to convert wide dataframes to narrow ones. What I mean by wide is a dataframe with a high number of columns. Some dataframes are structured in a way that consecutive measurements or variables are represented as columns. In some cases, representing these columns as rows may fit better to our task. Consider the following dataframe:

We have three different cities and measurements done on different days. We decide to represent these days as rows in a column. There will also be a column to show the measurements. We can easily accomplish this by using melt function:

df_wide.melt(id_vars=['city'])

Variable and value column names are given by default. We can use var_name and value_name parameters of melt function to assign new column names.

11. Explode

Assume your data set includes multiple entries of a feature on a single observation (row) but you want to analyze them on separate rows.

We want to see the measurements of ‘c’ on day ‘1’ on separate rows which easily be done using explode.

df1.explode('measurement').reset_index(drop=True)

12. Nunique

Nunique counts the number of unique entries over columns or rows. It is very useful in categorical features especially in cases where we do not know the number of categories beforehand. Let’s take a look at our initial dataframe:

df.year.nunique()
10df.group.nunique()
3

We can directly apply nunique function to the dataframe and see the number of unique values in each column:

If axis parameter is set to 1, nunique returns the number of unique values in each row.

13. Lookup

It can be used to look up values in the DataFrame based on the values on other row, column pairs. This function is best explained via an example. Assume we have the following DataFrame:

For each day, we have measurements of 4 people and a column that includes the names of these 4 people. We want to create a new column that shows the measurement of the person in “Person” column. Thus, for the first row, the value in the new column will be 4 (the value in column “Alex”).

df['Person_point'] = df.lookup(df.index, df['Person'])
df

14. Infer_objects

Pandas supports a wide range of data types, one of which is object . Object covers text or mixed (numeric and non-numeric) values. However, it is not preferred to use object data type if a different option is available. Certain operations is executed faster with more specific data types. For example, we prefer to have integer or float data type for numerical values.

infer_objectsattempts to infer better data types for object columns. Consider the following dataframe:

df2.dtypes
A    object 
B    object 
C    object 
D    object 
dtype: object

All of the data types are object. Let’s see what the inferred data types are:

df2.infer_objects().dtypesA      int64 
B    float64 
C       bool 
D     object 
dtype: object

It may seem trivial but will definitely be useful when there are lots of columns.

15. Memory_usage

Memory_usage() returns how much memory each row uses in bytes. It is useful especially when we work with large dataframes. Consider the following dataframe with 1 million rows.

df_large = pd.DataFrame({'A': np.random.randn(1000000),
                    'B': np.random.randint(100, size=1000000)})df_large.shape
(1000000, 2)

And the memory usage for each column in bytes:

df_large.memory_usage()
Index        128 
A        8000000 
B        8000000 
dtype: int64

Memory usage of entire dataframe in megabytes:

df_large.memory_usage().sum() / (1024**2) #converting to megabytes15.2589111328125

16. Describe

Describe function calculates basic statistics for numeric columns which are count, mean, standard deviation, min and max values, median, first and third quartile. Thus, it provides a statistical summary of the dataframe.

17. Merge

Merge() combines DataFrames based on values in shared columns. Consider the following two dataframes.

We can merge them based on shared values in a column. The parameter that sets the condition for merging is the “ on ” parameter.

df1 and df2 are merged based on the common values in column_a. The how parameter of merge function allows to combine dataframes in different ways. The possible values for how are ‘inner’, ‘outer’, ‘left’, ‘right’.

  • inner: only rows with same values in the column specified by on parameter (default value of how parameter)
  • outer: all the rows
  • left: all rows from left DataFrame
  • right: all rows from right DataFrame

18. Select_dtypes

Select_dtypes function returns a subset of the DataFrame’s columns based on the condition set on data types. It allows to include or exclude certain data types using include and exlude parameters.

df.select_dtypes(include='int64')
df.select_dtypes(exclude='int64')

19. Replace

As the name suggests, it allows to replace values in a dataframe.

df.replace('A', 'A_1')

The first parameter is the value to replaced and the second one is the new value.

We can also pass in a dictionary for multiple replacements at the same time.

df.replace({'A':'A_1', 'B':'B_1'})

20. Applymap

Applymap function is used to apply a function to a dataframe elementwise. Please note that if a vectorized version of an operation is available, it should be preferred over applymap. For instance, if we want to multiple each element by a number, we don’t need and should not use applymap function. A simple vectorized operation (e.g. df * 4 ) is much faster in that case.

However, there might be some cases where we do not have the option of vectorized operation. For instance, we can change the style of a dataframe using Style property of pandas dataframes. The following function change the color of negative values as red.

def color_negative_values(val):
   color = 'red' if val < 0 else 'black'
   return 'color: %s' % color

We need to use applymap function to apply this function to a dataframe.

df3.style.applymap(color_negative_values)

About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK