ELFS: Label-Free Coreset Selection with Proxy Training Dynamics
Haizhong Zheng, Elisa Tsai, Yifu Lu, Jiachen Sun, Brian R. Bartoldson,, Bhavya Kailkhura, Atul Prakash

TL;DR
ELFS is a novel label-free coreset selection method that leverages deep clustering and a pruning technique to improve data subset selection for training deep learning models, reducing labeling costs.
Contribution
ELFS introduces a new approach combining deep clustering and double-end pruning to enhance label-free coreset selection performance.
Findings
ELFS outperforms state-of-the-art label-free methods on vision benchmarks.
ELFS achieves up to 10.2% accuracy improvement on ImageNet-1K.
ELFS effectively estimates data difficulty without ground truth labels.
Abstract
High-quality human-annotated data is crucial for modern deep learning pipelines, yet the human annotation process is both costly and time-consuming. Given a constrained human labeling budget, selecting an informative and representative data subset for labeling can significantly reduce human annotation effort. Well-performing state-of-the-art (SOTA) coreset selection methods require ground truth labels over the whole dataset, failing to reduce the human labeling burden. Meanwhile, SOTA label-free coreset selection methods deliver inferior performance due to poor geometry-based difficulty scores. In this paper, we introduce ELFS (Effective Label-Free Coreset Selection), a novel label-free coreset selection method. ELFS significantly improves label-free coreset selection by addressing two challenges: 1) ELFS utilizes deep clustering to estimate training dynamics-based data difficulty…
Peer Reviews
Decision·ICLR 2025 Poster
1. ELFS effectively addresses the limitations of previous label-free coreset selection approaches, providing a feasible solution that leverages deep clustering for pseudo-labeling. 2. By employing double-end pruning, ELFS improves the selection of informative samples, achieving consistent performance improvements over baselines, even in challenging scenarios. 3. The evaluation across multiple datasets and pruning rates, along with an ablation study, showcases ELFS's flexibility and robustness,
1. The experiments involve numerous hyperparameters, optimized through grid search. A more in-depth analysis of the underlying reasons behind these optimal values would strengthen the understanding of how different parameters affect the measurement of sample difficulty, offering clearer insights into the importance of hard examples. 2. The approach heavily relies on feature extractors like SwAV and DINO for clustering. It remains unclear if using more advanced encoders, such as CLIP, could furt
1. It is an elegant and effective idea to estimate the data difficulty score through deep clustering. This handles the challenge to measure the prediction uncertainty and sample difficulty without any human labels. 2. The proposed method is evaluated on multiple classification benchmark, showing notable performance gain compared with state-of-the-arts.The design of each module is well justified through ablation studies.
1. My major concern lies in the selection of hyper-parameter $\beta$. I can understand they require some grid search for hyper-parameters. However, according to Fig. 5, the optimal value is different for multiple datasets or sampling ratios, which is quite inefficient. For example, if there is a large dataset with millions of images, it is infeasible to do grid search on it. 2. Based on Tab. 7, it is quite strange that ResNet50 cannot outperform ResNet18 on the selected subset. I assume it reas
ELFS presents a compelling label-free coreset selection method that reduces the need for extensive and costly labeled datasets while achieving accuracy close to supervised methods. By effectively utilizing pseudo-labels, ELFS not only significantly outperforms other label-free baselines but also exhibits strong performance despite the inherent inaccuracies and noise associated with pseudo-labels. Moreover, the method demonstrates robustness and versatility, showing good transferability across di
The ELFS method is quite effective, but it mainly builds on familiar techniques like pseudo-labeling and coreset selection. This might make it seem less novel or groundbreaking to those familiar with the field. Despite this, it does a great job using these methods to ensure high accuracy and reliability. Moreover, to really show how well ELFS works and to expand its use, it would be beneficial to test it on a wider variety of datasets. This includes tackling larger and more complex datasets suc
Code & Models
Videos
Taxonomy
TopicsRough Sets and Fuzzy Logic · Fuzzy Logic and Control Systems
MethodsPruning
