Using SQL to Improve Missing Values

aiqAnyB.jpg!web

Missing data points can sometimes feel like missing jigsaw pieces (photo: author)

For a few months now, we’ve been looking at different ways to use SQL in data science — most recently on using SQL to detect outliers (not the same as detecting anomalies, although that’s in the future).

In the first post in this sporadic series of posts, we looked at how to assess missing data by counting the proportion of nulls in a column. However, simply finding and counting missing data points is of limited use — you need to have tools to use if you find these missing data points.

The core ways to handle missing values should be familiar to all data scientists, a phrase which here means ‘if you aren’t familiar, you should memorise the following list’:

Listwise deletion: if a variable has so many missing cases that it appears useless, delete it.
Casewise deletion: if there are too many factors missing for a particular observation, delete it.
Dummy Variable Adjustment: if the variable is missing for a particular case, use an assumed value in its stead. Depending on the problem the median may appear the intuitive choice or a value that represents a ‘neutral’ setting.
Imputation: use an algorithm to fill in the value, from a simple random number at the most basic end of the spectrum, to a value imputed by its own model at the more complex end.

SQL is clearly better at some of these than others. Listwise deletion is as simple as leaving the column name out of the SELECT statement; casewise deletion may be as simple as a WHERE clause testing for nulls.

Using these two methods comes with risks — you can easily introduce bias if the data are not Missing Completely at Random. It’s not hard to think of situations where measurements aren’t collected due to specific circumstances reflected in the data. An example given by Frank Harrell in his book ‘Regression Modeling Strategies’ is blood pressure measurements which are not taken on patients expected to die shortly — hence excluding patient observations missing blood pressure measurements in a model of patient outcomes could selectively exclude patients who died, creating an obvious skew.

While the above example illustrates the risk of casewise deletion, the disadvantage of listwise deletion can also be seen from the same example — if you ignore all blood pressure information, you ignore a lot of information useful for explaining medical outcomes.

A more sophisticated approach is to use Paul Allison, author of ‘Missing Data’ from the Sage series of short monographs on statistics calls ‘Dummy Variable Adjustment’ which involves creating a variable to indicate the existence of missing information. You also need to update the original variable to include a constant value in place of the nulls. In SQL this could be a simple CASE statement. We have assumed 2 for the numeric constant:

SELECT   CASE Missing_var   WHEN var is null THEN 1 ELSE 0   END,   coalesce(var, 2)FROM Table

The above code is implementation agnostic— if using MySQL, for example, you can replace the ‘isnull’ function with ‘ifnull’ as ‘isnull’ is a short cut logical test for null (it could replace the CASE statement above), and other implementations are likely to have similar short cuts.

These techniques for dealing with missing values are amongst the most basic. While it’s fair to say that some of the most advanced techniques are difficult to implement in SQL, these can still take you a reasonable distance in your journey.

One reason to perform these steps using at your database is that you can decouple your data preparation and exploration steps from your modeling step by delaying the choice of modeling environment i.e. R, Python, Rapidminer etc., until you have a somewhat ready data set. You can even perform the analysis in more than environment to check results.

Therefore, by analysing data ‘in place’ using SQL and doing some of the basic data preparation, you can maintain a higher degree of flexibility than possible via moving to a modeling environment at the first opportunity.

Robert de Graaf’s book, Managing Your Data Science Projects , is out now through Apress.

Follow Robert on Twitter

Recommend

深入asyncio.shield

An Introduction to Recurrent Neural Networks for Beginners

Introduction to dtplyr

Docker入门 - 如何创建你的第一个Docker应用

二叉查找树（BST）

编写第一个 Flutter 应用（第一篇）

聊聊单机房故障自愈中的经济学：投资与收益

FIBOS 链上资源模型介绍

Visual Studio Code C/C++ Extension: July 2019 Update

FIBOS 超级节点选举以及提案多签介绍

About Joyk