How to handle missing values in the data set?

Introduction

One of the main difficulties in applying neural networks to real-world problems is that the data set is often incomplete.

We can use several methods to deal with missing values.

The most common ones are to remove the sample containing the missing value or replace it with the variable's mean.

In any case, none of the above methods is always the most appropriate.

Therefore, we must study our data set in detail to obtain the best results.

Usually, missing values are denoted by a label in the data set.

Some standard labels used for representing missing values are NANA (not available), NaNNaN (not a number), UnknownUnknown, or ??.

Using numeric values, such as −999−999, is not recommended as they could be mistaken for reals values.

Throughout this work, we will choose the label NANA to denote a missing value.

We will now explain the primary methods for dealing with missing values and when to use each one.

2. Samples unusing

This method sets samples as unused if they have missing values, excluding samples with missing values from the analysis.

ui=unused,ifuicontainsNA,i=1,…,p.ui=unused,ifuicontainsNA,i=1,…,p.

We can use it when the number of samples in the data set is large, and the number of missing values is negligible concerning the number of samples.

Example: Samples unusing

A hospital has conducted a study to predict which treatment produces the best survival in patients with colon cancer after surgery.

How well a patient does after surgery depends on how much residual cancer remains.

This example examines data from a randomized controlled trial (RCT) measuring the effect of a particular drug combination on colon cancer.

The study data set has samples from 607 patients and 12 variables.

ID	Sex	Age	Obstruction	Perforation	...	Outcome
1	male	43	no	no	...	die
2	male	63	no	no	...	survive
...	...	...	...	...	...	...
928	female	48	yes	no	...	survive

The total number of missing values is 17.

Missing values	Missing samples
17 (0%)	17 (2%)

Since the number of missing values is negligible in the number of samples we have, we can eliminate the samples containing missing values from the analysis.

After removing the samples containing missing values, we are left with a dataset with 590 samples.

3. Data imputation

In some cases, almost all samples contain missing values, and choosing the first method would significantly lose information.

If the data set is small or the number of missing values is considerable, you cannot afford not to use samples with missing values.

In such cases, assigning probable values to the missing data is advisable.

In this sense, imputation replaces missing data with estimated data.

The most common imputation method replaces missing values with the mean value of the corresponding variable.

In some situations, the median or the mode is used instead of the mean as the imputation value.

dij=vjmean,ifdij=NA,i=1,…,p,j=1,…,q.dij=vjmean,ifdij=NA,i=1,…,p,j=1,…,q.

It is convenient to carry out a previous study of the data to know when to replace the missing value with the mean or median.

If the variable is numerical but has outliers, variables with asymmetric data, it is advisable to replace the missing values with the median.

On the contrary, if the variable is numerical and has no outliers, it is advisable to choose the median value of the data.

Example: Data imputation

A Portuguese bank institution aims to predict which bank clients will subscribe to a long-term deposit and which will not.

The data set used is related to the direct marketing campaigns of a Portuguese bank institution.

We have a dataset with 16 variables and 4120 samples.

A total of 8350 (12 %) missing values and 3591 (87 %) missing samples have been recorded.

Missing values	Missing samples
8350 (12%)	3591 (87%)

We would lose too much information if we eliminated all samples with missing values.

To solve the problem, we must make a previous study of the data and replace the missing values with the most reasonable estimate.

We can decide whether to replace missing values with the mean or the median by calculating the data statistics.

Variable	Maximum value	Minimum value	Mean	Median
age	87	19	41.22	39
education	3	1	2.15	2
balance	71182	-3313	1439.81	422
contact_type	1	0	0.09	0
last_contact	871	1	223.67	189
previous_conversion	1	0	0.20	0

Taking into account the statistical results and observing no outliers, we can replace the missing values with the mean of the corresponding variable.

4. Time series data interpolation

This method can be helpful to impute missing values in time series.

If we have a dataset with time series containing missing values, we can give it values by performing interpolation.

However, the results can only be beneficial if the data are well distributed.

However, the results can only be beneficial if the data are well distributed. For example, if we have several missing values in a row in the dataset, the interpolation in that period will be less accurate.

5. Conclusions

Missing values are unknown elements of the data matrix.

To perform the model, we have two options: to eliminate the samples containing missing values or to replace the missing values with the mean or median. There are other more advanced techniques, such as nearest neighbor.

Introduction

2. Samples unusing

Example: Samples unusing

3. Data imputation

Example: Data imputation

4. Time series data interpolation

5. Conclusions

Recommend

谷歌“狂飙”在生成式AI赛道最新模型可凭文字、图片“创作”音乐

三星正式发布43英寸Odyssey Neo G7：4K@144Hz的mini LED游戏显示器

Hackaday Podcast 203: Flashlight Fuel Fails, Weird DMA Machines, And A 3D Printe...

铭瑄发布GeForce RTX 4070 Ti iCraft显卡，陶瓷既视感白色外壳

Tyre Nichols Death: Timeline of Arrest, Death, Police Officers Charged

Pence: 'Mistakes Were Made' Over Classified Docs Found in Indiana Home

Leaked Apple documents confirm iPhone 15 Pro to feature Wi-Fi 6E and A17 Bionic

Google Created AI Tool That Can Create Music Based On Texts

《满江红》春节档夺冠，背后的欢喜传媒是啥来头？

茅台集团原董事长高卫东案详情，首度披露，其自称想和老板一样住洋房、住别墅

About Joyk