# Detection and evaluation of clusters within sequential data

**Authors:** Alexander Van Werde, Albert Senen–Cerda, Gianluca Kosmella, Jaron Sanders

PMC · DOI: 10.1007/s10618-025-01140-4 · Data Mining and Knowledge Discovery · 2025-08-14

## TL;DR

This paper evaluates new clustering algorithms on real-world sequential data to extract low-dimensional representations that reveal insights into complex processes.

## Contribution

First field study applying new clustering algorithms for Block Markov Chains to real-world sequential data.

## Key findings

- Algorithms successfully encode sequential structure in diverse data types like GPS, DNA, and financial yields.
- Low-dimensional representations enable new insights into the underlying complex processes.
- Empirical validation shows effectiveness across sparse, high-dimensional real-life data.

## Abstract

Sequential data is ubiquitous—it is routinely gathered to gain insights into complex processes such as behavioral, biological, or physical processes. Challengingly, such data not only has dependencies within the observed sequences, but the observations are also often high-dimensional, sparse, and noisy. These are all difficulties that obscure the inner workings of the complex process under study. One solution is to calculate a low-dimensional representation that describes (characteristics of) the complex process. This representation can then serve as a proxy to gain insight into the original process. However, uncovering such low-dimensional representation within sequential data is nontrivial due to the dependencies, and an algorithm specifically made for sequences is needed to guarantee estimator consistency. Fortunately, recent theoretical advancements on Block Markov Chains have resulted in new clustering algorithms that can provably do just this in synthetic sequential data. This paper presents a first field study of these new algorithms in real-world sequential data; a wide empirical study of clustering within a range of data sequences. We investigate broadly whether, when given sparse high-dimensional sequential data of real-life complex processes, useful low-dimensional representations can in fact be extracted using these algorithms. Concretely, we examine data sequences containing GPS coordinates describing animal movement, strands of human DNA, texts from English writing, and daily yields in a financial market. The low-dimensional representations we uncover are shown to not only successfully encode the sequential structure of the data, but also to enable gaining new insights into the underlying complex processes.

The online version contains supplementary material available at 10.1007/s10618-025-01140-4.

## Full-text entities

- **Species:** Homo sapiens (human, species) [taxon 9606]

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12354125/full.md

## Figures

9 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12354125/full.md

## References

11 references — full list in the complete paper: https://tomesphere.com/paper/PMC12354125/full.md

---
Source: https://tomesphere.com/paper/PMC12354125