A Label is Worth a Thousand Images in Dataset Distillation
Tian Qin, Zhiwei Deng, David Alvarez-Melis

TL;DR
This paper reveals that the key to effective dataset distillation is the use of structured soft labels rather than the synthetic data itself, challenging previous assumptions and providing new insights into data-efficient learning.
Contribution
The study demonstrates that soft labels with structured information are crucial for dataset distillation success, shifting focus from synthetic data generation to label quality.
Findings
Soft labels are the main factor in distillation performance.
Structured soft labels outperform unstructured ones.
Scaling laws relate soft label effectiveness to images-per-class.
Abstract
Data is a crucial factor in the performance of machine learning models, a principle that dataset distillation methods exploit by compressing training datasets into much smaller counterparts that maintain similar downstream performance. Understanding how and why data distillation methods work is vital not only for improving these methods but also for revealing fundamental characteristics of "good" training data. However, a major challenge in achieving this goal is the observation that distillation approaches, which rely on sophisticated but mostly disparate methods to generate synthetic data, have little in common with each other. In this work, we highlight a largely overlooked aspect common to most of these methods: the use of soft (probabilistic) labels. Through a series of ablation experiments, we study the role of soft labels in depth. Our results reveal that the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsMachine Learning and Data Classification
