Unsupervised Concept Drift Detection Techniques for Machine Learning Models with...

Unsupervised Concept Drift Detection Techniques for Machine Learning Models with Examples in Python

Concept drift is an serious operational issue for deployed machine learning models. Please refer to my earlier post for introduction and various concepts. Unsupervised drift detection techniques although always applicable for unsupervised models, are also effective frequently for supervised machine learning models. Supervised machine learning is essentially about finding the conditional distribution P(y|x). For supervised machine learning models, a change in P(x) is often accompanied by change in P(y|x). Essentially P(x) is used as a proxy for detecting change P(y|x). However, In some cases where P(x) is independent of P(y|x) and these techniques will fail.

We will go through a set of unsupervised drift detection algorithms in this post. Finally we will detect drift in a retail customer churn prediction models using the Nearest Neighbor count algorithm. The Python implementation is available in my open source project beymani in Github.

Unsupervised Drift Detection

The fact that unsupervised drift detection techniques can be used for unsupervised machine learning models is self evident. As an example, for customer cohort problem, you can use these techniques to decided whether to retrain your clusters.

However, why do we even want to apply unsupervised drift detection techniques for supervised ML models. For supervised drift detection, you will need the actual label soon after the deployed model has made the prediction. In many scenarios that’s not feasible.

Consider a customer churn prediction model that predicts whether a customer will churn within a month. Because of the time horizon associated with the prediction, we have to wait unto a month after prediction to learn about the true outcome. In such cases, the only option is unsupervised drift detection.

Application of these algorithms for supervised ML models is not infallible. They will fail when P(x) and P(y|x) are independent. P(x) may change with P(y|x) unchanged. On the other hand P(y|x) may change without any change in P(x). In both cases P(y) will change.

Next we will go through the algorithms. Many of the algorithms below work for univariate data only and are not useful for machine learning problems. Nonetheless we will review them.

The data set used is the incoming data for prediction by a deployed model. All the algorithms use 2 data sets. There two ways to construct the data sets

Have a reference data set collected right after model deployment. Current data is the most recent data over some time window
Have a sliding window, with one data set corresponding to the first half window and the other corresponding to the second half window.

With the second approach, drift detection will be gradual. With the first approach, the reference data set does not change and detected drift change is more sharp.

The citation for the papers for all these algorithms can be be found in this excellent survey paper on concept drift detection.

Adaptive Windowing

A widow is scanned for different split to find a split point halfway in the widow when the difference in the mean values in the two sub windows exceeds a threshold. The threshold is based on Hoeffding bound. Applicable for univariate data.

Kullback Leibler (KL) Divergence

Based on KL difference between distributions of reference data and recent data. Since KL divergence is asymmetric, you have to take average or maximum from the 2 versions. Large difference indicates drift

Kolmogorov Smirnov Test (KS)

Based on maximum difference between cumulative distribution of reference data and recent data. Applicable for univariate data.

Cramer Von Mises Test (CVM)

Based on aggregate weighted difference between cumulative distribution of reference data and recent data. More weights are given to the central region of distributions. Applicable for univariate data.

Anderson Darling Test (AD)

Based on aggregate weighted difference between cumulative distribution of reference data and recent data. More weights are given to the tail region of distributions. Applicable for univariate data.

Total Distribution Discrepancy Test (TD)

Based on total absolute difference between distributions of 2 data sets. Practical for univariate data

Relativized Distribution Discrepancy Test (RD)

Based on relativized absolute difference between distributions of 2 data sets.Difference is normalized with probability values from the 2 distributions. Practical for univariate data

Kernel Based Distribution Discrepancy Test (SCD)

First data set is divided into 2 parts and kernel based distribution is calculated from the first partition. Using the the kernel based distribution difference is log probability is found between 2nd partition of the first data set and the second data set. Practical for univariate data.

Page Hinkley Test (PH)

Calculates cumulative difference between a data value and the average value upto that point. Tracks the minimum of such cumulative differences. Drift is detected whenever the difference between current cumulative difference and the minimum exceeds a threshold. Applicable for univariate data.

Adaptive Cumulative Windows Model (ACWM)

Uses KL divergence between distribution of reference data and recent data. Since KL divergence is asymmetric there are 2 versions. If there is no drift in data, the difference between the 2 versions is small. If the difference is above a threshold drift is present. Practical for univariate data.

Abrupt Concept Drift Detection (DetectA)

Uses mean vector and covariance matrix for multivariate data. For concept drift based on mean vectors, Hotelling’s T square statistic is used for the difference in mean vector. For concept drift based on covariance matrix, Box M statistic is used involving the 2 covariance matrices and a pooled covariance matrix

PCA Based (SyncStream)

PCA is performed on two data sets. Take difference (e.g cosine distance) between the two most significant eigen vector.

PCA Based Dimension Reduction

Reduce to lower dimensional space with PCA. Treat each dimension as independent univariate data. Apply distribution difference with KL divergence or any other univariate detection technique for each dimension.

AutoEncoder Based Dimension Reduction

Same as the previous technique, except that AutoEncoder is used. It’s more powerful, because AutoEncoder can perform non linear dimension reduction.

Density Difference (LSDD-CDT)

Based on square of probability density difference summed. Gaussian kernel is used to approximate density difference

Local Drift Degree (LDD-DIS)

Identifies local drift by estimating local density change using KNN algorithm. Different subspaces are considered. In any subspace the ratio of number of data points for the reference sample to the sample size of the first is taken. The same ratio is calculated for the subspace with the current sample. When there is drift in the subspace, there is significant difference between the 2 ratios.

Virtual Classifier (VC)

Create an artificial label. Assign label 1 to one data set and -1 to the second data set. Build a binary classifier with some of the data with k fold cross validation. If the test accuracy is around 0.5, there is no drift. If it’s significantly higher than 0.5 drift is present

Nearest Neighbor Count (NNC)

Artifically assign labels 1 and -1 to the 2 data sets. For all data points from the two samples, count the k nearest neighbors that belong to the same class. Normalize count dividing by the total number of samples. If the count is high drift is present. Assuming the two data sets are of same size, the average count will be k/2. As drift takes hold, the count will approach k.

Drift in Customer Behavior Data with Nearest Neighbor Count

Consider an eCommerce company that has deployed a cromer churn prediction. The features consist of average customer behavior in the past 6 moths. The model predicts the probability of a customer churning within the next 1 month. We will estimate drift in customer behavior data using Nearest Neighbor Count algorithm. Here is a list of features

average transaction amount
average time gap between visits
average visit duration
average num of searches per session
average number of service issues
average num of calls or emails for issue resolution
average num of online payment issues

We are are going to use KNN classifier from ScikitLearn python library. We will use a python wrapper class making it easier to train and use the KNN classifier model. All you have to do is to edit a configuration file. The python driver code can be used as an example. Here are the solution steps

Assign an artificial class label (e.g 1) to the first data set and another class label (e.g 0) to the second data set.
Combine the 2 data sets and train KNN classifier model. Since KNN classifier is model free there is no model building involved in the training
If the average prediction probability is close to 0.5 i.e poor predictability, there is no drift. As drift sets in , average probability goes up.

Here is some result for the no drift case. We get a mean prediction probability pf 0.617, close enough to 0.50. Since the 2 data sets come from the same distribution, any neighborhood will roughly contain equal number of data points from both sets.

.........
[0.8 0.2]
[0.6 0.4]
[0.8 0.2]
[0.3 0.7]
[0.5 0.5]
[0.6 0.4]
[0.5 0.5]
[0.6 0.4]
mean prediction probability 0.617

Next we are going to artificially create drift by making the second data set undergo a distribution shift. Distribution shift is induced by scaling and shifting the feature variables by small amounts.

Here is the result. Mean prediction probability has jumped to 0.980, strongly indicating drift in data As we have applied a distribution shift to the second data set, the two data sets have separated making them more predictable. The average prediction probability will be function of the amount of distribution shift i.e drift

.......
[1. 0.]
[1. 0.]
[0.9 0.1]
[1. 0.]
[1. 0.]
[1. 0.]
[1. 0.]
mean prediction probability 0.980

Please refer to the tutorial document for all the steps for execution of this use case. Please feel free to make changes and experiment. For example, you can apply varying amount of distribution shift and see how the prediction probability is changing.

Wrapping Up

These algorithms are always applicable for detecting drift in unsupervised machine learning models. Unfortunately there is no such guarantee for supervised models. They will only work if there is dependency between P(x) and P(y|x).

For causal machine learning models, these two probabilities are independent. A supervised machine learning model is causal when the model is aligned with the underlying causal model i.e when the features are the causes and the response is the effect.

For deployed supervised machine learning models if the actual outcome is readily available after prediction, then these techniques should not be used. Instead supervised drift detection techniques should be used.