Rethinking Dataset Distillation: Hard Truths about Soft Labels

Priyam Dey; Aditya Sahdev; Sunny Bhati; Konda Reddy Mopuri; R. Venkatesh Babu

arXiv:2604.18811·cs.LG·April 22, 2026

Rethinking Dataset Distillation: Hard Truths about Soft Labels

Priyam Dey, Aditya Sahdev, Sunny Bhati, Konda Reddy Mopuri, R. Venkatesh Babu

PDF

TL;DR

This paper critically examines dataset distillation methods, revealing that soft labels diminish the importance of data quality and introducing a new compute-aware pruning technique to improve distillation efficiency.

Contribution

It provides a detailed scalability analysis of dataset distillation under different label regimes and introduces CA2D, a compute-aligned distillation method that outperforms existing approaches.

Findings

01

High-quality coresets do not outperform random baselines in soft label regimes.

02

Performance saturates in soft label settings regardless of subset quality.

03

CA2D outperforms current methods on ImageNet-1K at various IPC settings.

Abstract

Despite the perceived success of large-scale dataset distillation (DD) methods, recent evidence finds that simple random image baselines perform on-par with state-of-theart DD methods like SRe2L due to the use of soft labels during downstream model training. This is in contrast with the findings in coreset literature, where high-quality coresets consistently outperform random subsets in the hardlabel (HL) setting. To understand this discrepancy, we perform a detailed scalability analysis to examine the role of data quality under different label regimes, ranging from abundant soft labels (termed as SL+KD regime) to fixed soft labels (SL) and hard labels (HL). Our analysis reveals that high-quality coresets fail to convincingly outperform the random baseline in both SL and SL+KD regimes. In the SL+KD setting, performance further approaches nearoptimal levels relative to the full dataset,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.