Top 5 Interview Questions on Missing Value Imputation

This article was published as a part of the Data Science Blogathon.

Introduction

Missing values are the type of data that contains NaN or no value for the particular record of the dataset. Handling missing values is one of the most important and tricky parts of data cleaning and preprocessing in machine learning. Most machine learning algorithms do not perform well when there is a missing value in the dataset.

In this type of situation, there are only 2 ways to solve the problem:

Remove the samples which are containing missing values
Impute the missing values with any strategy

Removing the missing values every time is not a better approach to dealing with machine learning problems, as also it can contain some useful information. So the best approach would be to imputing the missing values.

As missing values imputation is a necessary step for every type of machine learning problem, there is a very high chance of missing values that can be present in real-time datasets. In this article, we will discuss the top 5 interview questions related to the missing data imputation in machine learning with their core intuition and working mechanism behind them. Let’s start exploring and solving the interview questions one by one.

Source: https://resources.biginterview.com/wp-content/uploads/2022/07/Panel-Interview-101-1080×675.jpg

Missing Value Imputation Interview Questions

1. What is Complete Case Analysis (CCA) in Machine Learning?

In Machine Learning, Complete Case Analysis is a technique in which all the samples containing missing values are dropped or removed from the dataset.

As Imputation of the missing values consumes so much computational power and time, sometimes, when there is a complexity of time we can use this method as it is easy to code and faster.

Although, Complete case analysis is not the best solution for handling the missing data, as by dropping the missing data, we are also losing some of the information of the data, and also it might be possible that sometimes the dropped missing data could also contain a piece of important information that the other data does not. So in most cases, while handling the missing data, complete case analysis is not preferred unless and until there is not any other option.

According to the researcher, these techniques should be considered when there is 5% or less than 5% of the data is missing from the dataset.

2. Which Imputation is better for numerical data with outliers, mean or median? What is the reason behind them?

Most of the time, when there is a missing value in numerical data, mean and median imputation is preferred the most. Where mean imputation imputes the values by the mean of the particular column and median imputation imputes the missing values by the median values of the column data.

When an outlier is present in the dataset, median imputation is preferred the most. As Mean imputation imputes the missing values by the mean value of the column, in case of an outlier, it will count the mean also considering the outlier values, so the mean of the particular column will be biased. Whereas in the case of median imputation, it counts the median of the column, so there will not be much effect of the outliers. Hence, median imputation is preferred for numerical data having outliers.

3. What is the difference between Univariate and Multivariate Imputation of the missing data? Give Examples.

4. What are KNN Imputer and Iterative Imputer? How they are different from each other?

5. What are the assumptions of KNN and Iterative Imputer? In which type of cases are they preferred?

Conclusion

In this article, we discussed the top 5 interview questions which can be asked related to the missing data imputation in machine learning and discussed each and every interview questions with its best possible answer and core intuition behind it. Practicing these interview questions will help one to understand the working method behind each technique better and also help one to answer the questions related to these topics effectively.

Some Key takeaways from this article are:

1. Complete Case Analysis is a technique that drops the missing data. It is not preferred the most, as deleting the missing values from the data is not the best suitable approach.

2. Mean imputation is not preferred most when an outlier is present in the dataset; use median imputation instead.

3. KNN and Iterative Imputers are multivariate imputation techniques that involve higher computation. Iterative Imputer performs exceptionally well on the data missing at random.

The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.

Introduction

Missing Value Imputation Interview Questions

1. What is Complete Case Analysis (CCA) in Machine Learning?

2. Which Imputation is better for numerical data with outliers, mean or median? What is the reason behind them?

3. What is the difference between Univariate and Multivariate Imputation of the missing data? Give Examples.

4. What are KNN Imputer and Iterative Imputer? How they are different from each other?

5. What are the assumptions of KNN and Iterative Imputer? In which type of cases are they preferred?

Conclusion

Related

Recommend

Palantir (PLTR) earnings Q3 2022

AMC is working with Zoom to turn some theaters into giant meeting rooms

黑石旗下中国房地产平台龙地持续扩张宣布重要收购和人事任命

Netflix wants more of Ryan Murphy’s Monster anthology

600吨重全球最大此前绝版的安-225飞机秘密重生

卡普空游戏近半销量由PC提供数字版占比高达90%

HaptX 推出新型企业级触感手套“HaptX Gloves G1”

高绩效团队跟踪的九个软件开发 KPI

基于云计算的技术在现代制造业中的相关性

以太坊联合创始人 Di Iorio 发布新项目，将区块链计算机带给更多的用户

About Joyk