Topological Quality of Subsets via Persistence Matching Diagrams

\'Alvaro Torras-Casas; Eduardo Paluzo-Hidalgo; Rocio Gonzalez-Diaz

arXiv:2306.02411·math.AT·October 1, 2024·2 cites

Topological Quality of Subsets via Persistence Matching Diagrams

\'Alvaro Torras-Casas, Eduardo Paluzo-Hidalgo, Rocio Gonzalez-Diaz

PDF

Open Access 1 Repo

TL;DR

This paper introduces a topological data analysis method called persistence matching diagrams to evaluate the quality of data subsets, aiding in understanding their representativeness and potential impact on machine learning performance.

Contribution

It presents a novel topological invariant and an efficient algorithm to assess subset quality and its relation to the full dataset, improving data selection strategies.

Findings

01

Persistence matching diagrams effectively measure subset representativeness.

02

The method provides bounds for Hausdorff distance between subset and dataset.

03

Application explains subset quality's impact on model performance.

Abstract

Data quality is crucial for the successful training, generalization and performance of machine learning models. We propose to measure the quality of a subset concerning the dataset it represents, using topological data analysis techniques. Specifically, we define the persistence matching diagram, a topological invariant derived from combining embeddings with persistent homology. We provide an algorithm to compute it using minimum spanning trees. Also, the invariant allows us to understand whether the subset ``represents well" the clusters from the larger dataset or not, and we also use it to estimate bounds for the Hausdorff distance between the subset and the complete dataset. In particular, this approach enables us to explain why the chosen subset is likely to result in poor performance of a supervised learning model.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

cimagroup/tdqual
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopological and Geometric Data Analysis · Data Visualization and Analytics