31

12 Amazing Pandas & NumPy Functions

 4 years ago
source link: https://towardsdatascience.com/12-amazing-pandas-numpy-functions-22e5671a45b8?gi=8c9a4e3326cd
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

We all know that Pandas and NumPy are amazing, and they play a crucial role in our day to day analysis. Without Pandas and NumPy, we would be left deserted in this huge world of data analytics and science. Today, I am going to share 12 amazing Pandas and NumPy functions that will make your life and analysis much easier than before. In the end, you can find a Jupyter Notebook for the code used in this article.

Let’s start with NumPy:

NumPy is the fundamental package for scientific computing with Python. It contains among other things:

  • a powerful N-dimensional array object
  • sophisticated (broadcasting) functions
  • tools for integrating C/C++ and Fortran code
  • useful linear algebra, Fourier transform, and random number capabilities

Besides its obvious scientific uses, NumPy can also be used as an efficient multi-dimensional container of generic data. Arbitrary data-types can be defined. This allows NumPy to seamlessly and speedily integrate with a wide variety of databases.

  1. argpartition()

NumPy has this amazing function which can find N largest values index. The output will be the N largest values index, and then we can sort the values if needed.

x = np.array([12, 10, 12, 0, 6, 8, 9, 1, 16, 4, 6, 0])index_val = np.argpartition(x, -4)[-4:]
index_val
array([1, 8, 2, 0], dtype=int64)np.sort(x[index_val])
array([10, 12, 12, 16])

2. allclose()

Allclose() is used for matching two arrays and getting the output in terms of a boolean value. It will return False if items in two arrays are not equal within a tolerance. It is a great way to check if two arrays are similar, which can actually be difficult to implement manually.

array1 = np.array([0.12,0.17,0.24,0.29])
array2 = np.array([0.13,0.19,0.26,0.31])# with a tolerance of 0.1, it should return False:
np.allclose(array1,array2,0.1)
False# with a tolerance of 0.2, it should return True:
np.allclose(array1,array2,0.2)
True

3. clip()

Clip() is used to keep values in an array within an interval. Sometimes, we need to keep the values within an upper and lower limit. For the mentioned purpose, we can make use of NumPy’s clip(). Given an interval, values outside the interval are clipped to the interval edges.

x = np.array([3, 17, 14, 23, 2, 2, 6, 8, 1, 2, 16, 0])np.clip(x,2,5)
array([3, 5, 5, 5, 2, 2, 5, 5, 2, 2, 5, 2])

4. extract()

Extract() as the name goes, is used to extract specific elements from an array based on a certain condition. With extract(), we can also use conditions like and and or .

# Random integers
array = np.random.randint(20, size=12)
array
array([ 0,  1,  8, 19, 16, 18, 10, 11,  2, 13, 14,  3])#  Divide by 2 and check if remainder is 1
cond = np.mod(array, 2)==1
cond
array([False,  True, False,  True, False, False, False,  True, False, True, False,  True])# Use extract to get the values
np.extract(cond, array)
array([ 1, 19, 11, 13,  3])# Apply condition on extract directly
np.extract(((array < 3) | (array > 15)), array)
array([ 0,  1, 19, 16, 18,  2])

5. where()

Where() is used to return elements from an array that satisfy a certain condition. It returns the index position of values that fall in a certain condition. This is almost similar to the where condition that we use in SQL, I’ll demonstrate that in the examples below.

y = np.array([1,5,6,8,1,7,3,6,9])# Where y is greater than 5, returns index position
np.where(y>5)
array([2, 3, 5, 7, 8], dtype=int64),)# First will replace the values that match the condition, 
# second will replace the values that does not
np.where(y>5, "Hit", "Miss")
array(['Miss', 'Miss', 'Hit', 'Hit', 'Miss', 'Hit', 'Miss', 'Hit', 'Hit'],dtype='<U4')

6. percentile()

Percentile() is used to compute the nth percentile of the array elements along the specified axis.

a = np.array([1,5,6,8,1,7,3,6,9])print("50th Percentile of a, axis = 0 : ",  
      np.percentile(a, 50, axis =0))
50th Percentile of a, axis = 0 :  6.0b = np.array([[10, 7, 4], [3, 2, 1]])print("30th Percentile of b, axis = 0 : ",  
      np.percentile(b, 30, axis =0))
30th Percentile of b, axis = 0 :  [5.1 3.5 1.9]

Let me know if you’ve used them earlier and how far did it help you. Let’s move on to the amazing Pandas.

Pandas:

pandas is a Python package providing fast, flexible, and expressive data structures designed to make working with structured (tabular, multidimensional, potentially heterogeneous) and time-series data both easy and intuitive.

pandas is well suited for many different kinds of data:

  • Tabular data with heterogeneously-typed columns, as in an SQL table or Excel spreadsheet
  • Ordered and unordered (not necessarily fixed-frequency) time-series data.
  • Arbitrary matrix data (homogeneously typed or heterogeneous) with row and column labels
  • Any other form of observational/statistical data sets. The data actually need not be labeled at all to be placed into a pandas data structure.

Here are just a few of the things that pandas does well:

  • Easy handling of missing data (represented as NaN) in floating point as well as non-floating point data
  • Size mutability: columns can be inserted and deleted from DataFrame and higher dimensional objects
  • Automatic and explicit data alignment: objects can be explicitly aligned to a set of labels, or the user can simply ignore the labels and let Series, DataFrame, etc. automatically align the data for you in computations
  • Powerful, flexible group by functionality to perform split-apply-combine operations on data sets, for both aggregating and transforming data
  • Make it easy to convert ragged, differently-indexed data in other Python and NumPy data structures into DataFrame objects
  • Intelligent label-based slicing, fancy indexing, and subsetting of large data sets
  • Intuitive merging and joining data sets
  • Flexible reshaping and pivoting of data sets
  • Hierarchical labeling of axes (possible to have multiple labels per tick)
  • Robust IO tools for loading data from flat files (CSV and delimited), Excel files, databases, and saving/loading data from the ultrafast HDF5 format
  • Time series-specific functionality: date range generation and frequency conversion, moving window statistics, date shifting and lagging.
  1. read_csv(nrows=n)

You might already be aware of the use of read_csv function. But, most of us still make a mistake of reading the entire .csv file even when it is not required. Let’s consider a situation where we are unaware of the columns and the data present in a .csv file of 10gb, reading whole .csv file here would not be a smart decision because it would be the unnecessary use of our memory and would take a lot of time. We can just import a few rows from the .csv file and then proceed further as per our need.

import io
import requests
# I am using this online data set just to make things easier for you guys
url = "https://raw.github.com/vincentarelbundock/Rdatasets/master/csv/datasets/AirPassengers.csv"
s = requests.get(url).content
# read only first 10 rows
df = pd.read_csv(io.StringIO(s.decode('utf-8')),nrows=10 , index_col=0)

2. map()

The map() function is used to map values of Series according to input correspondence. Used for substituting each value in a Series with another value, that may be derived from a function, a dict or a Series.

# create a dataframe
dframe = pd.DataFrame(np.random.randn(4, 3), columns=list('bde'), index=['India', 'USA', 'China', 'Russia'])#compute a formatted string from each floating point value in frame
changefn = lambda x: '%.2f' % x# Make changes element-wise
dframe['d'].map(changefn)

3. apply()

The apply() allows the users to pass a function and apply it on every single value of the Pandas series.

# max minus mix lambda fn
fn = lambda x: x.max() - x.min()# Apply this on dframe that we've just created above
dframe.apply(fn)

4. isin()

The isin() is used to filter data frames. isin() helps in selecting rows with having a particular(or Multiple) value in a particular column. It is the most useful function I’ve come across.

# Using the dataframe we created for read_csv
filter1 = df["value"].isin([112]) 
filter2 = df["time"].isin([1949.000000])df [filter1 & filter2]

5. copy()

The copy() is used to create a copy of a Pandas object. When you assign a data frame to another data frame, its value changes when you make changes in the other one. To prevent the mentioned issue, we can make use of copy().

# creating sample series 
data = pd.Series(['India', 'Pakistan', 'China', 'Mongolia'])# Assigning issue that we face
data1= data
# Change a value
data1[0]='USA'
# Also changes value in old dataframe
data# To prevent that, we use
# creating copy of series 
new = data.copy()# assigning new values 
new[1]='Changed value'# printing data 
print(new) 
print(data)

6. select_dtypes()

The select_dtypes() function returns a subset of the data frame's columns based on the column dtypes. The parameters of this function can be set to include all the columns having some specific data type or it could be set to exclude all those columns which has some specific data types.

# We'll use the same dataframe that we used for read_csv
framex =  df.select_dtypes(include="float64")# Returns only time column

Bonus:

pivot_table()

The most amazing and useful function of pandas is pivot_table. If you hesitate to use groupby and want to extend its functionalities then you can very well use the pivot_table. If you’re aware of how pivot table works in excel, then it’s might be a piece of cake for you. Levels in the pivot table will be stored in MultiIndex objects (hierarchical indexes) on the index and columns of the result DataFrame.

# Create a sample dataframe
school = pd.DataFrame({'A': ['Jay', 'Usher', 'Nicky', 'Romero', 'Will'], 
      'B': ['Masters', 'Graduate', 'Graduate', 'Masters', 'Graduate'], 
      'C': [26, 22, 20, 23, 24]})
# Lets create a pivot table to segregate students based on age and course
table = pd.pivot_table(school, values ='A', index =['B', 'C'], 
                         columns =['B'], aggfunc = np.sum, fill_value="Not Available") 
  
table

Do let me know down below in the comments if you guys have come across or use any other amazing functions. I would love to know more about them.

Jupyter Notebook (Code used) : https://github.com/kunaldhariwal/Medium-12-Amazing-Pandas-NumPy-Functions

LinkedIn : https://bit.ly/2u4YPoF

I hope that this has helped you to enhance your knowledge base :)

Follow me for more!

Thanks for your read and valuable time!


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK