Leveraging Data Symmetries to Select an Optimal Subset of Training Data under Label Noise

Kumar Shubham; Pavan Karjol; Kiran M K; Prathosh AP

arXiv:2605.01874·cs.LG·May 5, 2026

Leveraging Data Symmetries to Select an Optimal Subset of Training Data under Label Noise

Kumar Shubham, Pavan Karjol, Kiran M K, Prathosh AP

PDF

TL;DR

This paper explores how leveraging data symmetries and invariances can improve subset selection for training machine learning models in noisy, high-dimensional data environments, enhancing robustness and performance.

Contribution

It formally links k-NN accuracy to classifier performance on noisy data and demonstrates that exploiting data invariances improves subset selection in high-dimensional settings.

Findings

01

Exploiting data invariance enhances k-NN performance in noisy environments.

02

Invariance knowledge helps identify near-optimal training subsets.

03

Learned invariant representations facilitate subset selection with partial invariance knowledge.

Abstract

The performance of machine learning models often relies on large labeled datasets; however, data collected from diverse sources can contain label noise. Recent work has shown that, in noisy settings, there may exist a subset of the training data on which models can achieve performance comparable to training on a noise-free dataset. A widely used method for identifying such subsets is cutstats, which employs k-nearest neighbors (k-NN) to detect low-noise samples. However, its performance on high-dimensional data remains largely unexplored. In this work, we formally establish that the performance of a classifier trained on a subset of a noisy dataset selected via cutstats is influenced by the accuracy of k-NN. We further demonstrate that, in noisy environments, exploiting data invariance and knowledge of underlying symmetries can significantly enhance the performance of k-NN, bringing it…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.