Vector-Quantized Soft Label Compression for Dataset Distillation
Ali Abbasi, Ashkan Shahbazi, Hamed Pirsiavash, Soheil Kolouri

TL;DR
This paper introduces a vector-quantized autoencoder to compress soft labels in dataset distillation, significantly reducing storage costs while maintaining high performance on vision and language benchmarks.
Contribution
It presents a novel VQAE method for compressing soft labels in dataset distillation, reducing storage overhead by 30-40x without sacrificing effectiveness.
Findings
Achieves 30-40x compression on ImageNet-1K.
Retains over 90% of baseline performance.
Validates effectiveness on vision and language benchmarks.
Abstract
Dataset distillation is an emerging technique for reducing the computational and storage costs of training machine learning models by synthesizing a small, informative subset of data that captures the essential characteristics of a much larger dataset. Recent methods pair synthetic samples and their augmentations with soft labels from a teacher model, enabling student models to generalize effectively despite the small size of the distilled dataset. While soft labels are critical for effective distillation, the storage and communication overhead they incur, especially when accounting for augmentations, is often overlooked. In practice, each distilled sample is associated with multiple soft labels, making them the dominant contributor to storage costs, particularly in large-class settings such as ImageNet-1K. In this paper, we present a rigorous analysis of bit requirements across dataset…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning and Data Classification · Advanced Neural Network Applications · Advanced Image and Video Retrieval Techniques
