Class-Proportional Coreset Selection for Difficulty-Separable Data
Elisa Tsai, Haizhong Zheng, Atul Prakash

TL;DR
This paper introduces class-proportional coreset selection methods that account for class difficulty clustering, significantly improving data pruning effectiveness and model robustness in domains like security and medical imaging.
Contribution
It formalizes class-difficulty separability, proposes class-proportional sampling strategies, and demonstrates their superiority over class-agnostic methods across multiple datasets.
Findings
Class difficulty clusters by class in key domains.
Class-proportional methods outperform class-agnostic baselines.
Aggressive pruning improves generalization in noisy and imbalanced datasets.
Abstract
High-quality training data is essential for building reliable and efficient machine learning systems. One-shot coreset selection addresses this by pruning the dataset while maintaining or even improving model performance, often relying on training-dynamics-based data difficulty scores. However, most existing methods implicitly assume class-wise homogeneity in data difficulty, overlooking variation in data difficulty across different classes. In this work, we challenge this assumption by showing that, in domains such as network intrusion detection and medical imaging, data difficulty often clusters by class. We formalize this as class-difficulty separability and introduce the Class Difficulty Separability Coefficient (CDSC) as a quantitative measure. We demonstrate that high CDSC values correlate with performance degradation in class-agnostic coreset methods, which tend to overrepresent…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsFace and Expression Recognition
