UNSEEN: Enhancing Dataset Pruning from a Generalization Perspective
Furui Xu, Shaobo Wang, Jiajun Zhang, Chenghao Sun, Haixiang Tang, Linfeng Zhang

TL;DR
This paper introduces UNSEEN, a novel dataset pruning framework that improves sample selection by focusing on models' generalization ability, leading to more effective coresets and significant data reduction without performance loss.
Contribution
UNSEEN is a plug-and-play, multi-step dataset pruning method that scores samples based on models not trained on them, enhancing core set quality and outperforming existing methods.
Findings
Outperforms state-of-the-art pruning methods on CIFAR and ImageNet datasets.
Achieves 30% data reduction on ImageNet-1K without accuracy loss.
Demonstrates the effectiveness of generalization-based scoring in dataset pruning.
Abstract
The growing scale of datasets in deep learning has introduced significant computational challenges. Dataset pruning addresses this challenge by constructing a compact but informative coreset from the full dataset with comparable performance. Previous approaches typically establish scoring metrics based on specific criteria to identify representative samples. However, these methods predominantly rely on sample scores obtained from the model's performance during the training (i.e., fitting) phase. As scoring models achieve near-optimal performance on training data, such fitting-centric approaches induce a dense distribution of sample scores within a narrow numerical range. This concentration reduces the distinction between samples and hinders effective selection. To address this challenge, we conduct dataset pruning from the perspective of generalization, i.e., scoring samples based on…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsAdvanced Neural Network Applications · Domain Adaptation and Few-Shot Learning · Generative Adversarial Networks and Image Synthesis
