Time Series Hierarchical Clustering using Dynamic Time Warping in Python

Let us consider the following task: we have a bunch of evenly distributed time series of different lengths. The goal is to cluster time series by defining general patterns that are presented in the data.

Here I’d like to present one approach to solving this task. We will use hierarchical clustering and DTW algorithm as a comparison metric to the time series. The solution worked well on HR data (employee historical scores). For other types of time series, DTW function may work worse than other metrics like CID (Complexity Invariant Distance), MAE or correlation.

I will skip the theoretical explanations of hierarchical clustering and DTW algorithms and focus on why did I select such combination:

Hierarchical Clustering is simple, flexible, tunable (linkage criteria) and allows us not to cluster all trajectories
DTW method allows us to compare time series of different length and, by my experience, works perfectly with infrequent

Ok, here we go! Our imports:

import random
from copy import deepcopy
from scipy import interpolate
import numpy as np
from dtaidistance import dtwimport matplotlib.pyplot as pltfrom _plotly_future_ import v4_subplots
import plotly.graph_objects as go
from plotly.subplots import make_subplots

Some parameters for time series generation and our threshold:

NUM_OF_TRAJECTORIES number of trajectories that we have to cluster
MIN_LEN_OF_TRAJECTORY and MAX_LEN_OF_TRAJECTORY lower and upper length bounds for any trajectory
THRESHOLD our threshold for DTW

NUM_OF_TRAJECTORIES = 200
MIN_LEN_OF_TRAJECTORY = 16
MAX_LEN_OF_TRAJECTORY = 40THRESHOLD = 0.50

For simplicity, all our trajectories will lie between -1 and 1. Also, I added some smoothing.

for item in range(NUM_OF_TRAJECTORIES):
   length = random.choice(list(range(MIN_LEN_OF_TRAJECTORY, MAX_LEN_OF_TRAJECTORY + 1)))
   tempTrajectory = np.random.randint(low=-100, high=100, size=int(length / 4)).astype(float) / 100
   
   oldScale = np.arange(0, int(length / 4))
   interpolationFunction = interpolate.interp1d(oldScale, tempTrajectory)
   
   newScale = np.linspace(0, int(length / 4) - 1, length)
   tempTrajectory = interpolationFunction(newScale)
   
   trajectoriesSet[(str(item),)] = [tempTrajectory]

Notice, that all trajectories are stored as dictionary values of list type (for convenience, when we will start to union them into groups). For the same reason, the names of trajectories are stored as tuples.

Our algorithm is the following:

We find a pair of closest entities (trajectory-trajectory or trajectory-cluster or cluster-trajectory or cluster-cluster)
Group them into a single cluster if their distance is lower than the THRESHOLD
Repeat step 1
We stop our algorithm if we fail at step 2 or we get one big cluster (so all our trajectories get into it — it means that our THRESHOLD is very big)

The first part of the algorithm:

trajectories = deepcopy(trajectoriesSet)
distanceMatrixDictionary = {}iteration = 1
while True:
   distanceMatrix = np.empty((len(trajectories), len(trajectories),))
   distanceMatrix[:] = np.nan
   
   for index1, (filter1, trajectory1) in enumerate(trajectories.items()):
      tempArray = []
      
      for index2, (filter2, trajectory2) in enumerate(trajectories.items()):
         
         if index1 > index2:
            continue
         
         elif index1 == index2:
            continue
         
         else:
            unionFilter = filter1 + filter2
            sorted(unionFilter)
            
            if unionFilter in distanceMatrixDictionary.keys():
               distanceMatrix[index1][index2] = distanceMatrixDictionary.get(unionFilter)
               
               continue
            
            metric = []
            for subItem1 in trajectory1:
               
               for subItem2 in trajectory2:
                  metric.append(dtw.distance(subItem1, subItem2, psi=1))
            
            metric = max(metric)
            
            distanceMatrix[index1][index2] = metric
            distanceMatrixDictionary[unionFilter] = metric

Dictionary distanceMatrixDictionary helps us to keep already calculated distances.

Numpy array distanceMatrix is filled with np.nan at the beginning of each step. It is needed only to keep representations between pairs of indexes and calculated distances. May be removed after adding the same functionality to the distanceMatrixDictionary.

This part of code allows us to compare all possible options — trajectory-trajectory or trajectory-cluster or cluster-trajectory or cluster-cluster :

metric = []
for subItem1 in trajectory1:
   
   for subItem2 in trajectory2:
      metric.append(dtw.distance(subItem1, subItem2))metric = max(metric)

The last line above — metric = max(metric) — is linkage criteria called ‘complete linkage’. It worked better for me but you can try other criteria or even make it customized.

Okay, distances are calculated, let us proceed with the grouping process.

We find the lowest distance and a pair of indexes that provide us this distance.

Here for simplicity, we will work with only one pair (the first one). Even, if we have two, three or more pairs for the same distance — the rest will be processed step-by-step during the next iterations.

minValue = np.min(list(distanceMatrixDictionary.values()))if minValue > THRESHOLD:
   breakminIndices = np.where(distanceMatrix == minValue)
minIndices = list(zip(minIndices[0], minIndices[1]))minIndex = minIndices[0]

After getting the pair of indexes we need simply define entities names and values, combine them, put the combination into the dictionary and remove these single entities from it:

filter1 = list(trajectories.keys())[minIndex[0]]
filter2 = list(trajectories.keys())[minIndex[1]]trajectory1 = trajectories.get(filter1)
trajectory2 = trajectories.get(filter2)unionFilter = filter1 + filter2
sorted(unionFilter)trajectoryGroup = trajectory1 + trajectory2trajectories = {key: value for key, value in trajectories.items()
                if all(value not in unionFilter for value in key)}distanceMatrixDictionary = {key: value for key, value in distanceMatrixDictionary.items()
                            if all(value not in unionFilter for value in key)}trajectories[unionFilter] = trajectoryGroup

After that, we repeat the previous step until we have nothing to cluster.

I have described the general approach but this algorithm can be simplified, boosted and modified to avoid any recalculations.

As a result, we get groups like this:

YbaQ3uq.png!web

In this cluster, we see 3 time series of different lengths. All of them have the same general pattern: local minimum in the first third, then global peak in the second half and a global minimum in the end.

Some more results (here for each cluster the left subplot presents original trajectories lengths, right one — rescaled to the MAX_LEN_OF_TRAJECTORY for comparison):

aQnu6ri.png!web

Cluster #1–2 items

ZvqaUb3.png!web

Cluster #2–2 items

m6B3uiF.png!web

Cluster #3–3 items

rU3iYzE.png!web

Cluster #4–3 items

ENbYzuM.png!web

Cluster #5–3 items

Depending on THRESHOLD value we can make our clusters bigger (and more generalized) or smaller (more detailed).

What can we improve if the current approach does not perform well on another dataset?

We can try to replace DWT with another distance metric
We can try to work additionally with time series: scale/rescale, smooth or remove outliers
We can try to use different thresholds
We can try to change linkage criteria

After finding the optimal hyperparameters it is possible to refactor the code and speed-up the calculations.

Code to use is available here:

https://github.com/avchauzov/_articles/blob/master/1.1.trajectoriesClustering.ipynb

Recommend

新消费时代的双11狂欢，聚划算如何重新定义下沉市场？

北京确认接诊两例肺鼠疫病例

想咨询大家一个工作方面的问题，这几天都整的特别焦虑。

现在咸鱼片子挺多的啊,吓人.

有什么单机游戏配置要求不高开着修改器玩虐虐电脑轻松愉快吗

是不是大多数后端都害怕 css？

在校生被欠薪资劳动局不受理怎么处理

Nokia N78，有多少人还记得这款手机？哪一款手机，让你至今无法忘怀？

如何利用缓存机制实现JAVA类反射性能提升30倍

Demystifying CSS Pseudo-Classes (:nth-child vs. :nth-of-type)

About Joyk