30

A Practical Guide on Missing Values with Pandas

 4 years ago
source link: https://towardsdatascience.com/a-practical-guide-on-missing-values-with-pandas-8fb3e0b46c24?gi=bcc4ba6a908b
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
neoserver,ios ssh client

nYr6va2.jpg!web

Photo by Zach Lucero on Unsplash

Missing values indicate we do not have the information about a feature (column) of a particular observation (row). Why not just remove that observation from the dataset and go ahead? We can but should not. The reasons are:

  • We typically have many features of an observation so we don’t want to lose the observation just because of one missing feature. Data is valuable.
  • We typically have more than one observation with missing values. In some cases, we cannot afford to remove many observations from the dataset. Again, data is valuable.

In this post, we will go through how to detect and handle missing values as well as some key points to keep in mind.

The outline of the post:

  • Missing value markers
  • Detecting missing values
  • Calculations with missing values
  • Handling missing values

As always, we start with importing numpy and pandas.

import numpy as np
import pandas as pd

Missing value markers

The default missing value representation in Pandas is NaN but Python’s None is also detected as missing value.

s = pd.Series([1, 3, 4, np.nan, None, 8])
s
UvIzuq2.png!web

Although we created a series with integers, the values are upcasted to float because np.nan is float. A new representation for missing values is introduced with Pandas 1.0 which is <NA> . It can be used with integers without causing upcasting. We need to explicitly request the dtype to be pd.Int64Dtype().

s = pd.Series([1, 3, 4, np.nan, None, 8], dtype=pd.Int64Dtype())
s
UNnUJfr.png!web

The integer values are not upcasted to float.

Another missing value representation is NaT which is used to represent datetime64[ns] datatypes.

Note: np.nan’s do not compare equal whereas None’s are considered as equal.

rIfQR3n.png!web

Note: Not all missing values come in nice and clean np.nan or None format. For example, the dataset we work on may include “?” and “- -“ values in some cells. We can convert them to np.nan representation when reading the dataset into a pandas dataframe. We just need to pass these values to na_values parameter.


Recommend

About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK