Rethinking Dataset Distillation: Hard Truths about Soft Labels
Priyam Dey, Aditya Sahdev, Sunny Bhati, Konda Reddy Mopuri, R. Venkatesh Babu

TL;DR
This paper critically examines dataset distillation methods, revealing that soft labels diminish the importance of data quality and introducing a new compute-aware pruning technique to improve distillation efficiency.
Contribution
It provides a detailed scalability analysis of dataset distillation under different label regimes and introduces CA2D, a compute-aligned distillation method that outperforms existing approaches.
Findings
High-quality coresets do not outperform random baselines in soft label regimes.
Performance saturates in soft label settings regardless of subset quality.
CA2D outperforms current methods on ImageNet-1K at various IPC settings.
Abstract
Despite the perceived success of large-scale dataset distillation (DD) methods, recent evidence finds that simple random image baselines perform on-par with state-of-theart DD methods like SRe2L due to the use of soft labels during downstream model training. This is in contrast with the findings in coreset literature, where high-quality coresets consistently outperform random subsets in the hardlabel (HL) setting. To understand this discrepancy, we perform a detailed scalability analysis to examine the role of data quality under different label regimes, ranging from abundant soft labels (termed as SL+KD regime) to fixed soft labels (SL) and hard labels (HL). Our analysis reveals that high-quality coresets fail to convincingly outperform the random baseline in both SL and SL+KD regimes. In the SL+KD setting, performance further approaches nearoptimal levels relative to the full dataset,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
