38

Using SQL to Improve Missing Values

 5 years ago
source link: https://www.tuicool.com/articles/FnIruue
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
neoserver,ios ssh client

aiqAnyB.jpg!web

Missing data points can sometimes feel like missing jigsaw pieces (photo: author)

For a few months now, we’ve been looking at different ways to use SQL in data science — most recently on using SQL to detect outliers (not the same as detecting anomalies, although that’s in the future).

In the first post in this sporadic series of posts, we looked at how to assess missing data by counting the proportion of nulls in a column. However, simply finding and counting missing data points is of limited use — you need to have tools to use if you find these missing data points.

The core ways to handle missing values should be familiar to all data scientists, a phrase which here means ‘if you aren’t familiar, you should memorise the following list’:

  1. Listwise deletion: if a variable has so many missing cases that it appears useless, delete it.
  2. Casewise deletion: if there are too many factors missing for a particular observation, delete it.
  3. Dummy Variable Adjustment: if the variable is missing for a particular case, use an assumed value in its stead. Depending on the problem the median may appear the intuitive choice or a value that represents a ‘neutral’ setting.
  4. Imputation: use an algorithm to fill in the value, from a simple random number at the most basic end of the spectrum, to a value imputed by its own model at the more complex end.

SQL is clearly better at some of these than others. Listwise deletion is as simple as leaving the column name out of the SELECT statement; casewise deletion may be as simple as a WHERE clause testing for nulls.

Using these two methods comes with risks — you can easily introduce bias if the data are not Missing Completely at Random. It’s not hard to think of situations where measurements aren’t collected due to specific circumstances reflected in the data. An example given by Frank Harrell in his book ‘Regression Modeling Strategies’ is blood pressure measurements which are not taken on patients expected to die shortly — hence excluding patient observations missing blood pressure measurements in a model of patient outcomes could selectively exclude patients who died, creating an obvious skew.

While the above example illustrates the risk of casewise deletion, the disadvantage of listwise deletion can also be seen from the same example — if you ignore all blood pressure information, you ignore a lot of information useful for explaining medical outcomes.

A more sophisticated approach is to use Paul Allison, author of ‘Missing Data’ from the Sage series of short monographs on statistics calls ‘Dummy Variable Adjustment’ which involves creating a variable to indicate the existence of missing information. You also need to update the original variable to include a constant value in place of the nulls. In SQL this could be a simple CASE statement. We have assumed 2 for the numeric constant:

SELECT   CASE Missing_var   WHEN var is null THEN 1 ELSE 0   END,   coalesce(var, 2)FROM Table

The above code is implementation agnostic— if using MySQL, for example, you can replace the ‘isnull’ function with ‘ifnull’ as ‘isnull’ is a short cut logical test for null (it could replace the CASE statement above), and other implementations are likely to have similar short cuts.

These techniques for dealing with missing values are amongst the most basic. While it’s fair to say that some of the most advanced techniques are difficult to implement in SQL, these can still take you a reasonable distance in your journey.

One reason to perform these steps using at your database is that you can decouple your data preparation and exploration steps from your modeling step by delaying the choice of modeling environment i.e. R, Python, Rapidminer etc., until you have a somewhat ready data set. You can even perform the analysis in more than environment to check results.

Therefore, by analysing data ‘in place’ using SQL and doing some of the basic data preparation, you can maintain a higher degree of flexibility than possible via moving to a modeling environment at the first opportunity.

Robert de Graaf’s book, Managing Your Data Science Projects , is out now through Apress.

Follow Robert on Twitter


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK