Measuring Dataset Diversity from a Geometric Perspective
Yang Ba, Mohammad Sadeq Abolhasani, Michelle V Mancenido, Rong Pan

TL;DR
This paper introduces a novel geometric approach using topological data analysis and persistence landscapes to measure dataset diversity, capturing structural richness beyond traditional entropy-based metrics.
Contribution
It presents a new diversity metric based on topological data analysis that effectively quantifies geometric and structural properties of datasets.
Findings
PLDiv correlates strongly with dataset complexity
The method is reliable across diverse data modalities
It offers interpretable insights into dataset structure
Abstract
Diversity can be broadly defined as the presence of meaningful variation across elements, which can be viewed from multiple perspectives, including statistical variation and geometric structural richness in the dataset. Existing diversity metrics, such as feature-space dispersion and metric-space magnitude, primarily capture distributional variation or entropy, while largely neglecting the geometric structure of datasets. To address this gap, we introduce a framework based on topological data analysis (TDA) and persistence landscapes (PLs) to extract and quantify geometric features from data. This approach provides a theoretically grounded means of measuring diversity beyond entropy, capturing the rich geometric and structural properties of datasets. Through extensive experiments across diverse modalities, we demonstrate that our proposed PLs-based diversity metric (PLDiv) is powerful,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopological and Geometric Data Analysis · Cell Image Analysis Techniques · Morphological variations and asymmetry
