Rethinking Large-scale Dataset Compression: Shifting Focus From Labels to Images
Lingao Xiao, Songhua Liu, Yang He, Xinchao Wang

TL;DR
This paper introduces a benchmark and a new dataset compression framework called PCA that focus on image data and hard labels, revealing that random subsets can perform competitively and challenging the reliance on soft labels in large-scale dataset distillation.
Contribution
The paper presents a benchmark for fair comparison of dataset compression methods and proposes PCA, a framework emphasizing image data and hard labels, achieving state-of-the-art results.
Findings
Random subsets perform competitively in dataset distillation.
Overreliance on soft labels may overlook the value of image data.
PCA achieves state-of-the-art performance focusing on images and hard labels.
Abstract
Dataset distillation and dataset pruning are two prominent techniques for compressing datasets to improve computational and storage efficiency. Despite their overlapping objectives, these approaches are rarely compared directly. Even within each field, the evaluation protocols are inconsistent across various methods, which complicates fair comparisons and hinders reproducibility. Considering these limitations, we introduce in this paper a benchmark that equitably evaluates methodologies across both distillation and pruning literatures. Notably, our benchmark reveals that in the mainstream dataset distillation setting for large-scale datasets, which heavily rely on soft labels from pre-trained models, even randomly selected subsets can achieve surprisingly competitive performance. This finding suggests that an overemphasis on soft labels may be diverting attention from the intrinsic…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- he-yang/2025-rethinkdc-imagenet-random-ipc-1dataset· 5 dl5 dl
- he-yang/2025-rethinkdc-imagenet-random-ipc-10dataset· 7 dl7 dl
- he-yang/2025-rethinkdc-imagenet-random-ipc-20dataset· 8 dl8 dl
- he-yang/2025-rethinkdc-imagenet-random-ipc-50dataset· 27 dl27 dl
- he-yang/2025-rethinkdc-imagenet-random-ipc-100dataset· 50 dl50 dl
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Data Compression Techniques
