A Practical Guide on Missing Values with Pandas

4 years ago

source link: https://towardsdatascience.com/a-practical-guide-on-missing-values-with-pandas-8fb3e0b46c24?gi=bcc4ba6a908b
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

nYr6va2.jpg!web

Photo by Zach Lucero on Unsplash

Missing values indicate we do not have the information about a feature (column) of a particular observation (row). Why not just remove that observation from the dataset and go ahead? We can but should not. The reasons are:

We typically have many features of an observation so we don’t want to lose the observation just because of one missing feature. Data is valuable.
We typically have more than one observation with missing values. In some cases, we cannot afford to remove many observations from the dataset. Again, data is valuable.

In this post, we will go through how to detect and handle missing values as well as some key points to keep in mind.

The outline of the post:

Missing value markers
Detecting missing values
Calculations with missing values
Handling missing values

As always, we start with importing numpy and pandas.

import numpy as np
import pandas as pd

Missing value markers

The default missing value representation in Pandas is NaN but Python’s None is also detected as missing value.

s = pd.Series([1, 3, 4, np.nan, None, 8])
s

Although we created a series with integers, the values are upcasted to float because np.nan is float. A new representation for missing values is introduced with Pandas 1.0 which is <NA> . It can be used with integers without causing upcasting. We need to explicitly request the dtype to be pd.Int64Dtype().

s = pd.Series([1, 3, 4, np.nan, None, 8], dtype=pd.Int64Dtype())
s

The integer values are not upcasted to float.

Another missing value representation is NaT which is used to represent datetime64[ns] datatypes.

Note: np.nan’s do not compare equal whereas None’s are considered as equal.

Note: Not all missing values come in nice and clean np.nan or None format. For example, the dataset we work on may include “?” and “- -“ values in some cells. We can convert them to np.nan representation when reading the dataset into a pandas dataframe. We just need to pass these values to na_values parameter.

Recommend

www.tuicool.com 5 years ago
Cache

Pandas: Sort rows or columns in Dataframe based on values using Dataframe.sort_v...

In this article we will discuss how to sort rows in ascending and descending order based on values in a single or multiple columns . Also, how to sort columns based on values in rows using DataFrame.sort_values()

mc.ai 4 years ago
Cache

3 Highly Practical Operations of Pandas

3 Highly Practical Operations of Pandas Sample, where, isin explained in detail with examples. Photo by

www.geeksforgeeks.org 3 years ago
Cache

Replace NaN Values with Zeros in Pandas DataFrame - GeeksforGeeks

Replace NaN Values with Zeros in Pandas DataFrame Last Updated: 03-07-2020 NaN stands for Not A Number and is one of the common ways to represent the missing value in the data. It is a special...

www.codesd.com 2 years ago
Cache

Create a pandas data frame with the date index and the random values in the co...

Create a pandas data frame with the date index and the random values in the column advertisements How do I create a pandas dataframe with da...

thispointer.com 2 years ago
Cache

Pandas | Count Unique Values in a Column

Count Unique Values in a Column – thisPointer.comThis article will discuss different ways to Count unique values in a Dataframe Column in Python. First of all, we will create a sample Dataframe from a list of tuples i.e. ...

thispointer.com 2 years ago
Cache

Count Unique Values in all Columns of Pandas Dataframe

Count Unique Values in all Columns of Pandas Dataframe – thisPointer.comSkip to content This article will discuss different ways to...

thispointer.com 2 years ago
Cache

Pandas | Count non-zero values in Dataframe Column

Pandas | Count non-zero values in Dataframe Column This article will discuss how to count the number of non-zero values in one or more Dataframe columns in Pandas. Let’s first create a Dataframe from a...

thispointer.com 2 years ago
Cache

Pandas – Count True Values in a Dataframe Column

Pandas – Count True Values in a Dataframe Column In this article, we will discuss different ways to count True values in a Dataframe Column. First of all, we will create a Dataframe from a list of tuples...

thispointer.com 2 years ago
Cache

Pandas – Check if all values in a Column are Equal

Pandas – Check if all values in a Column are Equal This article will discuss how to check if all values in a DataFrame Column are the same. First of all, we will create a DataFrame from a list of tuples,

www.confessionsofadataguy.com 1 year ago
Cache

Replacing Pandas with Polars. A Practical Guide.

Replacing Pandas with Polars. A Practical Guide...

A Practical Guide on Missing Values with Pandas

Missing value markers

Recommend

About Joyk