Vector-Quantized Soft Label Compression for Dataset Distillation

Ali Abbasi; Ashkan Shahbazi; Hamed Pirsiavash; Soheil Kolouri

arXiv:2603.03808·cs.CV·March 5, 2026

Vector-Quantized Soft Label Compression for Dataset Distillation

Ali Abbasi, Ashkan Shahbazi, Hamed Pirsiavash, Soheil Kolouri

PDF

Open Access

TL;DR

This paper introduces a vector-quantized autoencoder to compress soft labels in dataset distillation, significantly reducing storage costs while maintaining high performance on vision and language benchmarks.

Contribution

It presents a novel VQAE method for compressing soft labels in dataset distillation, reducing storage overhead by 30-40x without sacrificing effectiveness.

Findings

01

Achieves 30-40x compression on ImageNet-1K.

02

Retains over 90% of baseline performance.

03

Validates effectiveness on vision and language benchmarks.

Abstract

Dataset distillation is an emerging technique for reducing the computational and storage costs of training machine learning models by synthesizing a small, informative subset of data that captures the essential characteristics of a much larger dataset. Recent methods pair synthetic samples and their augmentations with soft labels from a teacher model, enabling student models to generalize effectively despite the small size of the distilled dataset. While soft labels are critical for effective distillation, the storage and communication overhead they incur, especially when accounting for augmentations, is often overlooked. In practice, each distilled sample is associated with multiple soft labels, making them the dominant contributor to storage costs, particularly in large-class settings such as ImageNet-1K. In this paper, we present a rigorous analysis of bit requirements across dataset…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning and Data Classification · Advanced Neural Network Applications · Advanced Image and Video Retrieval Techniques