MNIST-Nd: a set of naturalistic datasets to benchmark clustering across   dimensions

Polina Turishcheva; Laura Hansel; Martin Ritzert; Marissa A. Weis,; Alexander S. Ecker

arXiv:2410.16124·cs.LG·October 22, 2024

MNIST-Nd: a set of naturalistic datasets to benchmark clustering across dimensions

Polina Turishcheva, Laura Hansel, Martin Ritzert, Marissa A. Weis,, Alexander S. Ecker

PDF

Open Access

TL;DR

MNIST-Nd provides a set of high-dimensional, noisy datasets derived from MNIST to evaluate how clustering algorithms perform as dimensionality increases, addressing a gap in existing benchmarks.

Contribution

The paper introduces MNIST-Nd, a novel synthetic dataset suite for benchmarking clustering performance across dimensions from 2 to 64.

Findings

01

Leiden clustering algorithm shows robustness in high dimensions

02

Clustering performance degrades with increasing dimensionality

03

MNIST-Nd enables systematic study of dimensionality effects on clustering

Abstract

Driven by advances in recording technology, large-scale high-dimensional datasets have emerged across many scientific disciplines. Especially in biology, clustering is often used to gain insights into the structure of such datasets, for instance to understand the organization of different cell types. However, clustering is known to scale poorly to high dimensions, even though the exact impact of dimensionality is unclear as current benchmark datasets are mostly two-dimensional. Here we propose MNIST-Nd, a set of synthetic datasets that share a key property of real-world datasets, namely that individual samples are noisy and clusters do not perfectly separate. MNIST-Nd is obtained by training mixture variational autoencoders with 2 to 64 latent dimensions on MNIST, resulting in six datasets with comparable structure but varying dimensionality. It thus offers the chance to disentangle the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Clustering Algorithms Research

MethodsSparse Evolutionary Training