Class-Proportional Coreset Selection for Difficulty-Separable Data

Elisa Tsai; Haizhong Zheng; Atul Prakash

arXiv:2507.10904·cs.LG·August 15, 2025

Class-Proportional Coreset Selection for Difficulty-Separable Data

Elisa Tsai, Haizhong Zheng, Atul Prakash

PDF

Open Access

TL;DR

This paper introduces class-proportional coreset selection methods that account for class difficulty clustering, significantly improving data pruning effectiveness and model robustness in domains like security and medical imaging.

Contribution

It formalizes class-difficulty separability, proposes class-proportional sampling strategies, and demonstrates their superiority over class-agnostic methods across multiple datasets.

Findings

01

Class difficulty clusters by class in key domains.

02

Class-proportional methods outperform class-agnostic baselines.

03

Aggressive pruning improves generalization in noisy and imbalanced datasets.

Abstract

High-quality training data is essential for building reliable and efficient machine learning systems. One-shot coreset selection addresses this by pruning the dataset while maintaining or even improving model performance, often relying on training-dynamics-based data difficulty scores. However, most existing methods implicitly assume class-wise homogeneity in data difficulty, overlooking variation in data difficulty across different classes. In this work, we challenge this assumption by showing that, in domains such as network intrusion detection and medical imaging, data difficulty often clusters by class. We formalize this as class-difficulty separability and introduce the Class Difficulty Separability Coefficient (CDSC) as a quantitative measure. We demonstrate that high CDSC values correlate with performance degradation in class-agnostic coreset methods, which tend to overrepresent…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsFace and Expression Recognition