6

How to handle missing values in the data set?

 1 year ago
source link: https://www.neuraldesigner.com/blog/missing-values
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
neoserver,ios ssh client

Introduction

One of the main difficulties in applying neural networks to real-world problems is that the data set is often incomplete.

We can use several methods to deal with missing values.

The most common ones are to remove the sample containing the missing value or replace it with the variable's mean.

In any case, none of the above methods is always the most appropriate.

Therefore, we must study our data set in detail to obtain the best results.

Usually, missing values are denoted by a label in the data set.

Some standard labels used for representing missing values are NANA (not available), NaNNaN (not a number), UnknownUnknown, or ??.

Using numeric values, such as −999−999, is not recommended as they could be mistaken for reals values.

Throughout this work, we will choose the label NANA to denote a missing value.

We will now explain the primary methods for dealing with missing values and when to use each one.

2. Samples unusing

This method sets samples as unused if they have missing values, excluding samples with missing values from the analysis.

ui=unused,ifuicontainsNA,i=1,…,p.ui=unused,ifuicontainsNA,i=1,…,p.

We can use it when the number of samples in the data set is large, and the number of missing values is negligible concerning the number of samples.

Example: Samples unusing

A hospital has conducted a study to predict which treatment produces the best survival in patients with colon cancer after surgery.

How well a patient does after surgery depends on how much residual cancer remains.

This example examines data from a randomized controlled trial (RCT) measuring the effect of a particular drug combination on colon cancer.

The study data set has samples from 607 patients and 12 variables.

ID Sex Age Obstruction Perforation ... Outcome
1 male 43 no no ... die
2 male 63 no no ... survive
... ... ... ... ... ... ...
928 female 48 yes no ... survive

The total number of missing values is 17.

Missing values Missing samples
17 (0%) 17 (2%)

Since the number of missing values is negligible in the number of samples we have, we can eliminate the samples containing missing values from the analysis.

After removing the samples containing missing values, we are left with a dataset with 590 samples.

3. Data imputation

In some cases, almost all samples contain missing values, and choosing the first method would significantly lose information.

If the data set is small or the number of missing values is considerable, you cannot afford not to use samples with missing values.

In such cases, assigning probable values to the missing data is advisable.

In this sense, imputation replaces missing data with estimated data.

The most common imputation method replaces missing values with the mean value of the corresponding variable.

In some situations, the median or the mode is used instead of the mean as the imputation value.

dij=vjmean,ifdij=NA,i=1,…,p,j=1,…,q.dij=vjmean,ifdij=NA,i=1,…,p,j=1,…,q.

It is convenient to carry out a previous study of the data to know when to replace the missing value with the mean or median.

If the variable is numerical but has outliers, variables with asymmetric data, it is advisable to replace the missing values with the median.

On the contrary, if the variable is numerical and has no outliers, it is advisable to choose the median value of the data.

Example: Data imputation

A Portuguese bank institution aims to predict which bank clients will subscribe to a long-term deposit and which will not.

The data set used is related to the direct marketing campaigns of a Portuguese bank institution.

We have a dataset with 16 variables and 4120 samples.

A total of 8350 (12 %) missing values and 3591 (87 %) missing samples have been recorded.

Missing values Missing samples
8350 (12%) 3591 (87%)

We would lose too much information if we eliminated all samples with missing values.

To solve the problem, we must make a previous study of the data and replace the missing values with the most reasonable estimate.

We can decide whether to replace missing values with the mean or the median by calculating the data statistics.

Variable Maximum value Minimum value Mean Median
age 87 19 41.22 39
education 3 1 2.15 2
balance 71182 -3313 1439.81 422
contact_type 1 0 0.09 0
last_contact 871 1 223.67 189
previous_conversion 1 0 0.20 0

Taking into account the statistical results and observing no outliers, we can replace the missing values with the mean of the corresponding variable.

4. Time series data interpolation

This method can be helpful to impute missing values in time series.

If we have a dataset with time series containing missing values, we can give it values by performing interpolation.

However, the results can only be beneficial if the data are well distributed.

However, the results can only be beneficial if the data are well distributed. For example, if we have several missing values in a row in the dataset, the interpolation in that period will be less accurate.

5. Conclusions

Missing values are unknown elements of the data matrix.

To perform the model, we have two options: to eliminate the samples containing missing values or to replace the missing values with the mean or median. There are other more advanced techniques, such as nearest neighbor.


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK