TL;DR
This paper introduces a non-parametric classifier for time series classification that leverages topological signatures and Sinkhorn divergences, effectively distinguishing chaotic system states even with noise and unknown models.
Contribution
It proposes a novel approach combining persistent homology, kernel density estimates, and Sinkhorn divergences for robust, model-free time series classification.
Findings
Accurately discriminates between similar chaotic system states.
Robust performance in noisy conditions.
Effective in classifying long, complex signals.
Abstract
Distinguishing between classes of time series sampled from dynamic systems is a common challenge in systems and control engineering, for example in the context of health monitoring, fault detection, and quality control. The challenge is increased when no underlying model of a system is known, measurement noise is present, and long signals need to be interpreted. In this paper we address these issues with a new non parametric classifier based on topological signatures. Our model learns classes as weighted kernel density estimates (KDEs) over persistent homology diagrams and predicts new trajectory labels using Sinkhorn divergences on the space of diagram KDEs to quantify proximity. We show that this approach accurately discriminates between states of chaotic systems that are close in parameter space, and its performance is robust to noise.
| System | Model | Class 0 | Class 1 |
| Logistic | uniform | ||
| Henon | uniform | ||
| Lorenz | uniform | ||
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Sinkhorn Divergence of Topological Signature Estimates for Time Series Classification
Colin Stephen
School of Computing, Electronics and Mathematics, Coventry University UK
Email: [email protected]
Abstract
Distinguishing between classes of time series sampled from dynamic systems is a common challenge in systems and control engineering, for example in the context of health monitoring, fault detection, and quality control. The challenge is increased when no underlying model of a system is known, measurement noise is present, and long signals need to be interpreted. In this paper we address these issues with a new non parametric classifier based on topological signatures. Our model learns classes as weighted kernel density estimates (KDEs) over persistent homology diagrams and predicts new trajectory labels using Sinkhorn divergences on the space of diagram KDEs to quantify proximity. We show that this approach accurately discriminates between states of chaotic systems that are close in parameter space, and its performance is robust to noise.
I Introduction
Automatic labelling of time series is a significant challenge in many scientific applications: predicting cardiac pathologies from electrocardiogram (ECG) traces, the imminent failure of engine components from vibration measurements, or the type of a star from its light spectrum variations for example. Features that are invariant under transformations such as nonlinear warping in the time domain are often important for effective predictions in these contexts because measurement sampling rates may differ between samples, the same physical processes may evolve at different rates for different samples, or the morphology of the signal during important events may be the deciding factor between classes. So a wide range of methods for time series classification based on different invariances are available, offering a variety of performance characteristics suited to different applications [1, 2]. In the case of dynamic systems, it is also known that signal decomposition and interpretation methods such as spectral and cepstral analysis, and phase space reconstructions using Takens embedding theorem, provide useful features to interpret and compare the states of systems [3, 4, 5].
The approach taken in this paper compares global descriptors of time series, like symbolic aggregate approximations (SAX) or bags of patterns (BOP) do, but without the need to decompose signals and using a metric that is invariant to nonlinear time domain transformations, like dynamic time warping (DTW) is. In particular we show how to summarise the topological features of classes of time series in a concise way, and how to quickly quantify similarity between the topology of an unlabelled time series and these class summaries.
Our method applies techniques from topological data analysis (TDA) in the form of persistent homology (PH) to characterize signals [6, 7]. The central object of interest in PH is the persistence diagram (PD) or barcode which provides a concise and stable formal representation of the topological features present in a data set at all metric scales simultaneously [8, 9]. This global multi-scale perspective allows TDA to expose features otherwise overlooked by conventional nonlinear dimensionality reduction techniques. For an introduction see [9], for algorithmic aspects [10], and for examples of metric space representations on the space of PDs needed to use TDA in off-the-shelf machine learning pipelines see [11, 12].
II Related Work
Time series have previously been analysed using TDA, largely via PH of point clouds constructed using delay embeddings [13, 14, 15, 16]. However delay embeddings require heuristic estimation of the embedding dimension and delay size to be computed beforehand, increase the influence of noise, and lead to a rapid increase in the complexity of computing PH of filtrations on the point cloud, requiring subsampling techniques such as the witness complex [17, 18, 19]. One previous study avoids embeddings but relies on a coarse grained statistic derived from PDs to characterise time series, the persistent entropy [20]. Our method is similar to this since we use a filtration on the time series directly, however we use a metric on the space of persistence diagrams directly rather than on the space of persistent entropy histograms.
Our method uses persistence images (PIs) as its underlying stable representation of PH, however rather than using these as feature vectors in a support vector classifier as in [11], we apply a distance metric to them. Following [21] we treat PIs as kernel estimators of the density of expected PDs for a class, and to measure distance between these density estimates and new PIs we use the Sinkhorn divergence [22]. The latter is an upper bound approximation to the Wasserstein distance between distributions which can be computed very quickly [23, 24, 25]. The Sinkhorn divergence has recently been applied to scalable clustering and averaging of large PDs [26], but not yet to classification using topological features.
Much previous work has been done on characterising and comparing trajectories of dynamical systems, but this often focuses on methods for distinguishing between chaotic and non-chaotic (periodic, quasi-periodic, intermittent) behaviour, for example by analysing spectra of Lyapunov exponents or textures of recurrence plots [27, 28]. The method we develop here is suited to this type of classification but to showcase its fine-grained capabilities we focus numerical experiments on showing that it can distinguish between different chaotic regimes that lie very close to one another in parameter space.
III Persistent Homology and Optimal Transport
We first outline persistent homology in terms of sub level sets of functions, then the persistence image representation of persistence diagrams, before introducing entropy regularized optimal transport metrics between probability distributions. These topics form the backbone of the classifier pipeline defined and applied in following sections.
III-A Persistent Homology and Persistence Images
Given a bounded continuous function on a topological space define sublevel sets for each in . Then given the inclusion induces a homomorphism of homology groups: for each dimension . Under mild conditions on , for any the homomorphism is not an isomorphism for only finitely many values of for all , and is finitely generated [29]. This guarantees that the following procedure results in a finite data structure: step through the values of at which the homology of changes and record the maximal intervals such that homology classes appearing in some live in precisely one interval and no where else. The filtration values describing each such interval are often called the birth and death values of the corresponding topological feature in the filtration. The translation between a one dimensional space and its sub level set (birth, death) pairs is particularly easy to visualize (Fig.1).
Persistent homology became an effective way to quantify changes in topology across scales when it was recognised that by adding points in with countably infinite multiplicity to the multiset of (birth, death) pairs arising from the process above, the resulting persistent homology barcode or persistence diagram (PD) is a stable representation under perturbations to [29]. In other words there is then a natural metric on PDs bounded above by some distance between the underlying functions [31, 9, 32]. Frequently this stability is stated as a bound on the -Wasserstein distance between sublevel set PDs of Lipschitz functions with respect to the metric:
[TABLE]
where diagrams are PDs for , the constants are independent of , and ranges over all bijections between the countably infinite multisets (see [8] for details). Because of this the Wasserstein distance has been at the heart of applications of TDA as it has developed. However due to the complexity of optimizing over bijections [33] and because it imposes a complicated geometry on the space of diagrams [34] there has been a glut of alternative vector space representations.
One such representation is the discrete persistence image (PI) which in turn is defined in terms of a continuous persistence surface (PS) arising from a PD [11]. The latter is a way to estimate the distribution from which a diagram is sampled while ensuring that the estimate itself is stable. Given a diagram as defined above, note that the persistence value is at least zero for all points so we can translate to the positive quadrant via . Next consider a weighted sum of Gaussians centered at points :
[TABLE]
where is a continuous, piecewise differentiable function that decays to zero on the horizontal axis. This is the persistence surface associated to . It is a stable estimate of the distribution underlying given the constrains on (see [11, 21] for details). To provide a vector space representation of suitable for use in machine learning applications, a finite regular grid is then placed over part of and used to quantize as a collection of pixels , corresponding to each cell . This collection of pixels is the persistence image representation of and it is a suitable vector representation for off-the-shelf classifiers such as support vector machines (SVMs) to use [11].
III-B Entropy Regularized Discrete Optimal Transport
The question of optimizing transport costs between distributions of resources has been studied in various forms since the 1700s, and in its modern guise is central to statistical learning theory [35, 36]. Given two -bin histograms , a transport plan between and is a matrix satisfying and . Equivalently, is a joint probability for two multinomial random variables and taking values in , whose marginals are and . We write for the set of all transport plans for -bin histograms , dropping the subscript when it is clear from context.
Given a transport plan , we interpret as a mass to be transported from the -th component of to the -th component of . If the cost of this operation is per unit of mass transported, then the discrete optimal transport (OT) problem is to minimise the sum of transport costs over all possible plans for given :
[TABLE]
where is the Frobenius product . The function is a metric on the space of histograms when is itself a metric distance matrix [37, 35] but computing it exactly is difficult in practice, with the worst case time complexity being reached with certain values of and [38].
Recent work showed that regularizing the classical problem by adding a convex constraint can lead to fast approximations [23, 39, 24, 22, 40, 25]. Define the entropy of a transport plan as , then minimising the sum of transport costs over high entropy transport plans as in
[TABLE]
where
[TABLE]
gives an upper bound approximation to that has complexity [23].111Different penalty functions lead to approximations with different convergence characteristics [24], but we consider only entropy in this paper. This upper bound is called the Sinkhorn divergence between the histograms since it has a natural parallel implementation based on iterated matrix-vector products known as the Sinkhorn Knopp (SK) algorithm [22]. Early evidence suggests that it gives better classification results than in a number of experiments and that it converges very quickly in practice [23]. Moreover, when the cost matrix is highly structured as is the case for distances on regular grids, the matrix operations of the algorithm can be speeded up further via FFT based convolutions [25]. This speed and accuracy advantage has increased interest in Sinkhorn divergence and other regularized variants of for a variety of problems ranging from color transfer in image processing to model optimization in machine learning [24, 41]. In the following sections we show that it can be integrated successfully within classification pipelines based on topological features as well.
IV Sinkhorn Divergence of Topological Signature Estimates
The problem we address now is to classify time series generated by deterministic dynamical systems. Suppose we are given time series samples from two classes corresponding to a choice of parameters in a single dynamical model. Can we characterize the data in terms of their shared or distinct topological properties without resorting to embeddings or parametric methods, and use this to effectively predict the class labels of new series? Represent a dataset containing samples each of length using an array , and represent class labels using a 0-1 vector .222The model we define is equally valid for classifying time series of different lengths and with more than two classes, with minor adjustments.
Training
- T1
For each sequence directly compute its sublevel set persistence diagram . 2. T2
Partition the set of PDs according to their associated class labels in , giving two sets of PDs: and its compliment . 3. T3
For each of and overlay the points in its member diagrams to form a combined persistence diagram representing the whole class: and . 4. T4
Choose a continuous and piecewise differentiable function such that for all , and a smoothing radius . Construct the smoothed persistence surfaces and . 5. T5
Choose a square grid that extends beyond the largest values of and in . Compute the persistence images and over the cells of .
In practice the form of , the value , and the size of can all be set at this stage using cross validation on the training data. After stage T5 we have for each class a stable kernel estimate of the density of its expected persistence diagram, which naturally leads to the following prediction pipeline.
Prediction
- P1
Given an unlabelled query sequence compute its persistence image using the pipeline above but skipping the diagram overlay steps T2 and T3, and reusing the same values for , and chosen in T4. 2. P2
Choose a value for an underlying metric on the grid , which induces a cost matrix on . Also choose a regularization parameter for computing Sinkhorn divergences . Compute the Sinkhorn divergences
[TABLE] 3. P3
If then predict , if then predict , else predict or with equal probability.
In practice the values of and can both be optimized using cross validation during the training phase.
Thus our model predicts labels for new time series based on the closest expected persistence diagram for each class in the training set, using the entropy regularized optimal transport distance between the distributions.
Implementation
Computing the sub level set persistence of each time series at stages T1 and P1 requires determining its critical points (local maxima and minima) and also noting for each local maximum which of its two neighbouring minima is closer in value. Thus the critical points must be sorted as part of the process, which is an operation at worst and for each class.
To compute the values of the persistence image pixels and for various numerical integration and approximation methods are available. In particular if we assume that each point appearing in a cell is centered in that cell we can approximate the persistence surfaces and by convolving their underlying -weighted histograms with a discrete filter corresponding to our chosen Gaussian. This allows us to compute the persistence images generated by T4 and T5 in a single step, in for our grid.
Finally during prediction, computing the regularized optimal transport cost between two -bin histograms for a cost matrix corresponding to distances on the grid is . This is because is a block Toeplitz of Toeplitz blocks (BTTB) matrix in this situation, meaning the matrix-vector products appearing in the Sinkhorn-Knopp (SK) algorithm used to compute can be computed using FFT enhanced convolutions. See [23, 22] for details of the SK algorithm and in paricular Chapter 5 of [42] for details of how to speed up the matrix-vector operations.
The result is that once the size of the grid has been set during training the complexity of the model is in time series length.
V Classification Experiments333Sklearn-compatible Python code implementing the classifiers described here can be found at https://github.com/colinstephen/icmla2018
We call the method above ‘Persistence Image Classification using Regularized Optimal Transport’, or PICROT for short. This section assesses its performance against one classifier using signal frequency and rate of change analysis, and one classifier based on persitent entropy as defined and successfully applied to similar problems in [43].444General purpose time series classifiers such as those benchmarked in [1] do not seem to perform well for the dynamic systems considered here. Initial results using dynamic time warping (DTW) and random forests were not competitive in terms of accuracy, while the potentially high-performance collective of transformation ensembles (COTE) and the elastic ensemble (EE) methods were too slow to evaluate due to the lengths of time series used here.
- C
PICROT compares kernel estimates of PD densities using the Sinkhorn divergence. We fix a weight function for PICROT that increases rapidly from zero to one in an interval less than the persistence value of any off diagonal points processed. Thus in effect we apply uniform weights when constructing persistence images. The smoothing parameter and grid size for PIs, and the regularization parameter for the Sinkhorn metric, are all estimated using 5-fold cross validation over a grid of candidate values during training. 2. C
CEPS is the one nearest neighbor classifier using Euclidean distance between coefficients of the discrete cosine transforms of the cepstra of the time series being compared:
[TABLE]
where
[TABLE]
is the Fourier transform, and the sum is over all coefficients 3. C
PENT is the one nearest neighbor classifier using the absolute difference between persistent entropies [20]. If is an indexed set of the off-diagonal points in the persistence diagram associated to , is the persistence of each point and is the total persistence of the diagram, then
[TABLE]
where
[TABLE]
CEPS is not commonly used as a general time series similarity measure but it is effective when the series have an underlying regularity or cyclicity [44, 5] as with the problems here. The measure captures information about the relative rates of change of the two signals across their frequency bands. Cross validating the number of leading terms compared in the cepstral classifier did not improve results, so we compare the entire cepstra. On the other hand compares signals using a coarse grained statistic derived from their persistence diagrams. We also use one nearest neighbour here because it shows higher accuracy than the receiver operating characteristic (ROC) optimized threshold approach originally appearing in [20].
Data
Binary time series classification problems were generated using the combinations of parameter values for the systems defined in Table I (13 parameter value combinations). The parameter ranges were chosen to ensure that all trajectories studied here are chaotic. Initial conditions and parameter values themselves were varied uniformly in the given intervals to ensure a wide variety of trajectories were observed.
In the case of the Henon and Lorenz systems only the values were used. In all cases the raw time series were normalized before any training or predictions and the first 1,000 sequence values were discarded. The subsequent lengths used for classification varied from 2,500 to 15,000 in steps of 2,500 (6 lengths total). White noise was added to each sample varying in standard deviation from 0.0 (no noise) to 0.75 in steps of 0.125 (7 noise levels in total). Thus there were 546 classification experiment configurations in total. Two example time series snippets from different classes in the fourth Henon configuration are shown in Figure 2.
For each of the 546 configurations 200 samples balanced between the two classes were generated, giving a total of 109,200 time series to be processed on each run. The data for each experiment were split randomly in to 170 training and 30 test time series to be classified. This process was then repeated for 10 runs per experiment, using different random train-test splits each time.
Results
The classifier PICROT outperformed the benchmarks in the vast majority of the experiments, as shown in Figure 4. In particular it tended to maintain higher relative accuracy of predictions against the benchmarks as noise levels were increased. This is most easily seen in Figure 3 where we visualize typical accuracy profiles. In the case of the Lorenz system experiments, PICROT was much more accurate than the PENT classifier, outperformed CEPS on shorter sequences, and matched its performance on longer ones. In experiments generated by the Henon system PICROT outperformed both benchmarks for almost all combinations of sequence length and noise level when the parameter was varied. This is clearly illustrated in the almost completely white grayscale maps representing the rank performance of PICROT against both of its competitors in Figure 4. Another trend visible in Figure 4 is that the sweet spot for accuracy of PICROT outperforming the others is at mid-range levels of noise (around a standard deviation of 0.375) and mid-range series lengths of around 7500 to 10000 points. Nevertheless white areas constitute over 85% of these maps on average, suggesting that PICROT is a strong performer in a wide variety of situations.
One additional benefit of using PICROT on time series is that it is possible to visualise the topological signatures of the training data classes. This helps to gain an insight into whether or not this topological feature is a ‘strong candidate’ for discriminating classes in a given situation. For example in Figure 2 we visualise the difference in PD class densities associated to the generated time series snippets. The structure and clear separation of locations of the ‘plumes’ in the third diagram suggest that the estimated PD density is likely to be a good feature in this case, while a more diffuse pattern may indicate that hyperparameters of the model may need to be tuned further or that noise is too dominant.
VI Conclusion
We defined a stable and scalable metric on the space of persistence diagrams based on entropic smoothing of optimal transport distances. Following [11] our approach makes use of weight functions applied to kernel density estimates on PDs to ensure our representations are 1-Wasserstein stable, but we treat the resulting quantized kernel density estimates as two dimensional histograms after renormalization. This allows direct application of regularized transport methods [22, 25].
Unlike existing topological methods for time series PICROT avoids the complexity explosion and noise amplification associated with high dimensional reconstructions based on Takens’ embedding theorem. In contrast to traditional signal decomposition methods noise removal is not required for the method to perform well. After cross validation during training the prediction compexity of the model is with respect to time series length, so it can be applied to long sequences effectively. In addition the underlying metric has a natural parallel implementation (the Sinkhorn Knopp algorithm) suitable for GPUs meaning it is highly scalable. To illustrate these benefits we conducted experiments classifying trajectories of chaotic deterministic systems under a range of signal to noise ratios and parameter values, finding that PICROT is more accurate than two specialized benchmarks in the majority of situations.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] A. Bagnall, J. Lines, A. Bostrom, J. Large, and E. Keogh, “The great time series classification bake off: a review and experimental evaluation of recent algorithmic advances,” Data Mining and Knowledge Discovery , vol. 31, no. 3, pp. 606–660, May 2017. [Online]. Available: http://dx.doi.org/10.1007/s 10618-016-0483-9
- 2[2] P. Esling and C. Agon, “Time-series data mining,” ACM Computing Surveys (CSUR) , vol. 45, no. 1, pp. 1–34, 2012.
- 3[3] H. Kantz and T. Schreiber, Nonlinear time series analysis . Cambridge university press, 2004, vol. 7.
- 4[4] L. H. Koopmans, The spectral analysis of time series . Elsevier, 1995.
- 5[5] R. B. Randall, “A history of cepstrum analysis and its application to mechanical problems,” Mechanical Systems and Signal Processing , vol. 97, pp. 3–19, 2017.
- 6[6] G. Carlsson, “Topology and data,” Bulletin of the American Mathematical Society , pp. 1–49, 2009. [Online]. Available: http://www.ams.org/journals/bull/2009-46-02/S 0273-0979-09-01249-X/
- 7[7] H. Edelsbrunner and D. Morozov, “Persistent Homology : Theory and Practice,” 2014.
- 8[8] D. Cohen-Steiner, H. Edelsbrunner, J. Harer, and Y. Mileyko, “Lipschitz Functions Have L p subscript 𝐿 𝑝 L_{p} -Stable Persistence,” Foundations of Computational Mathematics , vol. 10, pp. 127–139, 2010.
