34

Time Series Hierarchical Clustering using Dynamic Time Warping in Python

 4 years ago
source link: https://towardsdatascience.com/time-series-hierarchical-clustering-using-dynamic-time-warping-in-python-c8c9edf2fda5?gi=4128df66b81a
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
neoserver,ios ssh client

Let us consider the following task: we have a bunch of evenly distributed time series of different lengths. The goal is to cluster time series by defining general patterns that are presented in the data.

Here I’d like to present one approach to solving this task. We will use hierarchical clustering and DTW algorithm as a comparison metric to the time series. The solution worked well on HR data (employee historical scores). For other types of time series, DTW function may work worse than other metrics like CID (Complexity Invariant Distance), MAE or correlation.

I will skip the theoretical explanations of hierarchical clustering and DTW algorithms and focus on why did I select such combination:

  1. Hierarchical Clustering is simple, flexible, tunable (linkage criteria) and allows us not to cluster all trajectories
  2. DTW method allows us to compare time series of different length and, by my experience, works perfectly with infrequent

Ok, here we go! Our imports:

import random
from copy import deepcopy
from scipy import interpolate
import numpy as np
from dtaidistance import dtwimport matplotlib.pyplot as pltfrom _plotly_future_ import v4_subplots
import plotly.graph_objects as go
from plotly.subplots import make_subplots

Some parameters for time series generation and our threshold:

  • NUM_OF_TRAJECTORIES number of trajectories that we have to cluster
  • MIN_LEN_OF_TRAJECTORY and MAX_LEN_OF_TRAJECTORY lower and upper length bounds for any trajectory
  • THRESHOLD our threshold for DTW
NUM_OF_TRAJECTORIES = 200
MIN_LEN_OF_TRAJECTORY = 16
MAX_LEN_OF_TRAJECTORY = 40THRESHOLD = 0.50

For simplicity, all our trajectories will lie between -1 and 1. Also, I added some smoothing.

for item in range(NUM_OF_TRAJECTORIES):
   length = random.choice(list(range(MIN_LEN_OF_TRAJECTORY, MAX_LEN_OF_TRAJECTORY + 1)))
   tempTrajectory = np.random.randint(low=-100, high=100, size=int(length / 4)).astype(float) / 100
   
   oldScale = np.arange(0, int(length / 4))
   interpolationFunction = interpolate.interp1d(oldScale, tempTrajectory)
   
   newScale = np.linspace(0, int(length / 4) - 1, length)
   tempTrajectory = interpolationFunction(newScale)
   
   trajectoriesSet[(str(item),)] = [tempTrajectory]

Notice, that all trajectories are stored as dictionary values of list type (for convenience, when we will start to union them into groups). For the same reason, the names of trajectories are stored as tuples.

Our algorithm is the following:

  1. We find a pair of closest entities (trajectory-trajectory or trajectory-cluster or cluster-trajectory or cluster-cluster)
  2. Group them into a single cluster if their distance is lower than the THRESHOLD
  3. Repeat step 1
  4. We stop our algorithm if we fail at step 2 or we get one big cluster (so all our trajectories get into it — it means that our THRESHOLD is very big)

The first part of the algorithm:

trajectories = deepcopy(trajectoriesSet)
distanceMatrixDictionary = {}iteration = 1
while True:
   distanceMatrix = np.empty((len(trajectories), len(trajectories),))
   distanceMatrix[:] = np.nan
   
   for index1, (filter1, trajectory1) in enumerate(trajectories.items()):
      tempArray = []
      
      for index2, (filter2, trajectory2) in enumerate(trajectories.items()):
         
         if index1 > index2:
            continue
         
         elif index1 == index2:
            continue
         
         else:
            unionFilter = filter1 + filter2
            sorted(unionFilter)
            
            if unionFilter in distanceMatrixDictionary.keys():
               distanceMatrix[index1][index2] = distanceMatrixDictionary.get(unionFilter)
               
               continue
            
            metric = []
            for subItem1 in trajectory1:
               
               for subItem2 in trajectory2:
                  metric.append(dtw.distance(subItem1, subItem2, psi=1))
            
            metric = max(metric)
            
            distanceMatrix[index1][index2] = metric
            distanceMatrixDictionary[unionFilter] = metric

Dictionary distanceMatrixDictionary helps us to keep already calculated distances.

Numpy array distanceMatrix is filled with np.nan at the beginning of each step. It is needed only to keep representations between pairs of indexes and calculated distances. May be removed after adding the same functionality to the distanceMatrixDictionary.

This part of code allows us to compare all possible options — trajectory-trajectory or trajectory-cluster or cluster-trajectory or cluster-cluster :

metric = []
for subItem1 in trajectory1:
   
   for subItem2 in trajectory2:
      metric.append(dtw.distance(subItem1, subItem2))metric = max(metric)

The last line above — metric = max(metric) — is linkage criteria called ‘complete linkage’. It worked better for me but you can try other criteria or even make it customized.

Okay, distances are calculated, let us proceed with the grouping process.

We find the lowest distance and a pair of indexes that provide us this distance.

Here for simplicity, we will work with only one pair (the first one). Even, if we have two, three or more pairs for the same distance — the rest will be processed step-by-step during the next iterations.

minValue = np.min(list(distanceMatrixDictionary.values()))if minValue > THRESHOLD:
   breakminIndices = np.where(distanceMatrix == minValue)
minIndices = list(zip(minIndices[0], minIndices[1]))minIndex = minIndices[0]

After getting the pair of indexes we need simply define entities names and values, combine them, put the combination into the dictionary and remove these single entities from it:

filter1 = list(trajectories.keys())[minIndex[0]]
filter2 = list(trajectories.keys())[minIndex[1]]trajectory1 = trajectories.get(filter1)
trajectory2 = trajectories.get(filter2)unionFilter = filter1 + filter2
sorted(unionFilter)trajectoryGroup = trajectory1 + trajectory2trajectories = {key: value for key, value in trajectories.items()
                if all(value not in unionFilter for value in key)}distanceMatrixDictionary = {key: value for key, value in distanceMatrixDictionary.items()
                            if all(value not in unionFilter for value in key)}trajectories[unionFilter] = trajectoryGroup

After that, we repeat the previous step until we have nothing to cluster.

I have described the general approach but this algorithm can be simplified, boosted and modified to avoid any recalculations.

As a result, we get groups like this:

YbaQ3uq.png!web

In this cluster, we see 3 time series of different lengths. All of them have the same general pattern: local minimum in the first third, then global peak in the second half and a global minimum in the end.

Some more results (here for each cluster the left subplot presents original trajectories lengths, right one — rescaled to the MAX_LEN_OF_TRAJECTORY for comparison):

aQnu6ri.png!web

Cluster #1–2 items

ZvqaUb3.png!web

Cluster #2–2 items

m6B3uiF.png!web

Cluster #3–3 items

rU3iYzE.png!web

Cluster #4–3 items

ENbYzuM.png!web

Cluster #5–3 items

Depending on THRESHOLD value we can make our clusters bigger (and more generalized) or smaller (more detailed).

What can we improve if the current approach does not perform well on another dataset?

  1. We can try to replace DWT with another distance metric
  2. We can try to work additionally with time series: scale/rescale, smooth or remove outliers
  3. We can try to use different thresholds
  4. We can try to change linkage criteria

After finding the optimal hyperparameters it is possible to refactor the code and speed-up the calculations.


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK