A Label is Worth a Thousand Images in Dataset Distillation

Tian Qin; Zhiwei Deng; David Alvarez-Melis

arXiv:2406.10485·cs.LG·January 22, 2025

A Label is Worth a Thousand Images in Dataset Distillation

Tian Qin, Zhiwei Deng, David Alvarez-Melis

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper reveals that the key to effective dataset distillation is the use of structured soft labels rather than the synthetic data itself, challenging previous assumptions and providing new insights into data-efficient learning.

Contribution

The study demonstrates that soft labels with structured information are crucial for dataset distillation success, shifting focus from synthetic data generation to label quality.

Findings

01

Soft labels are the main factor in distillation performance.

02

Structured soft labels outperform unstructured ones.

03

Scaling laws relate soft label effectiveness to images-per-class.

Abstract

Data $quality$ is a crucial factor in the performance of machine learning models, a principle that dataset distillation methods exploit by compressing training datasets into much smaller counterparts that maintain similar downstream performance. Understanding how and why data distillation methods work is vital not only for improving these methods but also for revealing fundamental characteristics of "good" training data. However, a major challenge in achieving this goal is the observation that distillation approaches, which rely on sophisticated but mostly disparate methods to generate synthetic data, have little in common with each other. In this work, we highlight a largely overlooked aspect common to most of these methods: the use of soft (probabilistic) labels. Through a series of ablation experiments, we study the role of soft labels in depth. Our results reveal that the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

sunnytqin/no-distillation
pytorchOfficial

Videos

A Label is Worth A Thousand Images in Dataset Distillation· slideslive

Taxonomy

TopicsMachine Learning and Data Classification