7

Top 5 Interview Questions on Missing Value Imputation

 1 year ago
source link: https://www.analyticsvidhya.com/blog/2022/11/top-5-interview-questions-on-missing-value-imputation/
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
neoserver,ios ssh client

 This article was published as a part of the Data Science Blogathon.

Introduction

Missing values are the type of data that contains NaN or no value for the particular record of the dataset. Handling missing values is one of the most important and tricky parts of data cleaning and preprocessing in machine learning. Most machine learning algorithms do not perform well when there is a missing value in the dataset.

In this type of situation, there are only 2 ways to solve the problem:

  • Remove the samples which are containing missing values
  • Impute the missing values with any strategy

Removing the missing values every time is not a better approach to dealing with machine learning problems, as also it can contain some useful information. So the best approach would be to imputing the missing values.

As missing values imputation is a necessary step for every type of machine learning problem, there is a very high chance of missing values that can be present in real-time datasets. In this article, we will discuss the top 5 interview questions related to the missing data imputation in machine learning with their core intuition and working mechanism behind them. Let’s start exploring and solving the interview questions one by one.

Missing Value Imputation
Source: https://resources.biginterview.com/wp-content/uploads/2022/07/Panel-Interview-101-1080×675.jpg

Missing Value Imputation Interview Questions

1. What is Complete Case Analysis (CCA) in Machine Learning?

In Machine Learning, Complete Case Analysis is a technique in which all the samples containing missing values are dropped or removed from the dataset.

As Imputation of the missing values consumes so much computational power and time, sometimes, when there is a complexity of time we can use this method as it is easy to code and faster.

coding-window-noshow.jpg

Although, Complete case analysis is not the best solution for handling the missing data, as by dropping the missing data, we are also losing some of the information of the data, and also it might be possible that sometimes the dropped missing data could also contain a piece of important information that the other data does not. So in most cases, while handling the missing data, complete case analysis is not preferred unless and until there is not any other option.

According to the researcher, these techniques should be considered when there is 5% or less than 5% of the data is missing from the dataset.

2. Which Imputation is better for numerical data with outliers, mean or median? What is the reason behind them?

Most of the time, when there is a missing value in numerical data, mean and median imputation is preferred the most. Where mean imputation imputes the values by the mean of the particular column and median imputation imputes the missing values by the median values of the column data.

When an outlier is present in the dataset, median imputation is preferred the most. As Mean imputation imputes the missing values by the mean value of the column, in case of an outlier, it will count the mean also considering the outlier values, so the mean of the particular column will be biased. Whereas in the case of median imputation, it counts the median of the column, so there will not be much effect of the outliers. Hence, median imputation is preferred for numerical data having outliers.

3. What is the difference between Univariate and Multivariate Imputation of the missing data? Give Examples.

Login Required

4. What are KNN Imputer and Iterative Imputer? How they are different from each other?

Login Required

5. What are the assumptions of KNN and Iterative Imputer? In which type of cases are they preferred?

Login Required

Conclusion

In this article, we discussed the top 5 interview questions which can be asked related to the missing data imputation in machine learning and discussed each and every interview questions with its best possible answer and core intuition behind it. Practicing these interview questions will help one to understand the working method behind each technique better and also help one to answer the questions related to these topics effectively.

Some Key takeaways from this article are:

1. Complete Case Analysis is a technique that drops the missing data. It is not preferred the most, as deleting the missing values from the data is not the best suitable approach.

2. Mean imputation is not preferred most when an outlier is present in the dataset; use median imputation instead.

3.  KNN and Iterative Imputers are multivariate imputation techniques that involve higher computation. Iterative Imputer performs exceptionally well on the data missing at random.

The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.

Related


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK