Detecting outliers in SAS: Part 2: Estimating scale

In a previous blog post on robust estimation of location, I worked through some of the examples in the survey article, "Robust statistics for outlier detection," by Peter Rousseeuw and Mia Hubert. I showed that SAS/IML software and PROC UNIVARIATE both support the robust estimators of location that are mentioned in the paper. Today's post looks at the robust estimators of scale that are mentioned in the same paper and works through more examples in the paper. The paper uses the following five measurements, which contain one outlier:
6.25, 6.27, 6.28, 6.34, 63.1

Robust scale statistics in SAS/IML software

SAS/IML software contains several functions for robust estimation of scale. For estimating scale, the MAD function is often used. The MAD statistic is an acronym for "median of all absolute deviations from the median." The MAD statistic is often multiplied by a constant in order to make it unbiased for data that are normally distributed. The constant is 1.483, but you don't need to remember that value because the MAD function has the "NMAD" option that automatically includes the multiplication factor, as shown in the following example:

proc iml;
x = {6.25, 6.27, 6.28, 6.34, 63.1};
mad = mad(x, "NMAD"); 
print mad;

Rousseeuw and Hubert briefly mention two other robust measures of scale: the Qn estimator (Rousseeuw and Croux, JASA, 1993) and the interquartile range (IQR), which is well-known from the Tukey box plot. You can compute both of these estimators in SAS/IML software, as follow:

Qn = mad(x, "QN"); 
call qntl(q, x, {0.25 0.75}); /* compute 25th and 75th percentile */
IQR = q[2] - q[1];
print Qn IQR;

The three robust estimates of scale are similar. They range from 0.04 (MAD) to 0.07 (IQR). The IQR is sometimes divided by 1.349 in order to estimate the scale of normally distributed data. If you divide 0.07 by 1.349, you get 0.052, which make the estimates even more similar.

The connection with outlier detection

All this discussion of robust estimation of location and scale is closely related to detecting outliers. In practice, outliers are often detected using a rule or formula. The classical rule is to compute z-scores, which are just the normalized values zi = (xi - x̄)/s, where x̄ is the sample mean and s is the sample standard deviation. An outlier is defined as any observation for which |zi| exceeds some cutoff value, typically 2.5 or 3.

This rule fails when there is a large outlier in the data. For example, the following SAS/IML statements compute the classical z-scores for the Rousseeuw and Hubert example:

/* rules to detect outliers */
z = (x - mean(x)) / std(x);
print z;

Because the mean and standard deviation are both influenced by the outlier, no observation has a large z-score, and therefore none is flagged as an outlier. However, using robust estimators in the z-score formula does successfully identify the outlier, as shown in the following statements:

zRobust = (x - median(x)) / mad(x, "NMAD");
print zRobust;

The outlier has a HUGE "robust score." Of course, you don't have to print out the scores and inspect them. The following SAS/IML statements use the LOC function (the most useful function that you've never heard of!) to find all of the data for which the robust z-score exceeds 2.5, and prints only the outliers:

outIdx = loc(abs(zRobust)>2.5);
if ncol(outIdx)>0 then 
   outliers = x[outIdx];
else 
   outliers = .;
print outliers;

Robust Estimates in the UNIVARIATE Procedure

The UNIVARIATE procedure also supports robust estimates of scale. The ROBUSTSCALE option on the PROC UNIVARIATE statement computes the robust estimates in the Rousseeuw and Hubert article, as well as others. The documentation for the UNIVARIATE procedure includes a section that describes the robust estimates of scale. The following example computes robust estimates of scale:

data a;
input x @@;
datalines;
6.25 6.27 6.28 6.34 63.1
;
 
proc univariate data=a robustscale;
   var x;
   ods select RobustScale;
run;

Notice that the output from PROC UNIVARIATE includes two columns. The first column is an unadjusted robust estimate. The second column estimates the standard deviation for normally distributed data, which can be derived from the first column.

Detecting outliers in SAS: Part 2: Estimating scale

Detecting outliers in SAS: Part 2: Estimating scale

Robust scale statistics in SAS/IML software

The connection with outlier detection

Robust Estimates in the UNIVARIATE Procedure

Recommend

H4S4: few improvments ideas

标签：夜话

Enterprise Analytics Uncovers the Predictive Nature of the Customer Experience

Add a loader to your website

标签：GTD

Detect Internet Connection Status In Browser

SAS Global Forum: Your experience, your way

Diving into Husky and Lint-staged

标签：CodeMirror

标签：AutoHotKey

About Joyk