The Remarkable Simplicity of Very High Dimensional Data: Application of   Model-Based Clustering

Fionn Murtagh

arXiv:0805.2756·stat.ME·January 11, 2011·J. Classif.

The Remarkable Simplicity of Very High Dimensional Data: Application of Model-Based Clustering

Fionn Murtagh

PDF

TL;DR

This paper demonstrates that very high dimensional data tend to have simple hierarchical structures, which can be effectively characterized using ultrametric topology, with applications in time series segmentation.

Contribution

It introduces a formal measure of ultrametricity to quantify hierarchical structure in high-dimensional data, revealing their inherent simplicity.

Findings

01

Ultrametricity increases with dimensionality and sparsity.

02

High-dimensional data exhibit pervasive hierarchical structure.

03

Applications include time series segmentation and modeling.

Abstract

An ultrametric topology formalizes the notion of hierarchical structure. An ultrametric embedding, referred to here as ultrametricity, is implied by a hierarchical embedding. Such hierarchical structure can be global in the data set, or local. By quantifying extent or degree of ultrametricity in a data set, we show that ultrametricity becomes pervasive as dimensionality and/or spatial sparsity increases. This leads us to assert that very high dimensional data are of simple structure. We exemplify this finding through a range of simulated data cases. We discuss also application to very high frequency time series segmentation and modeling.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.