Soft Label Pruning and Quantization for Large-Scale Dataset Distillation

Xiao Lingao; Yang He

arXiv:2604.18135·cs.CV·April 21, 2026

Soft Label Pruning and Quantization for Large-Scale Dataset Distillation

Xiao Lingao, Yang He

PDF

1 Repo

TL;DR

This paper introduces LPQLD, a method that significantly reduces soft label storage in large-scale dataset distillation while improving accuracy, by enhancing image and supervision diversity through pruning and quantization techniques.

Contribution

The paper proposes LPQLD, a novel approach combining label pruning and quantization to address label size issues and improve performance in large-scale dataset distillation.

Findings

01

Reduced soft label storage by 78x on ImageNet-1K and 500x on ImageNet-21K.

02

Achieved up to 7.2% and 2.8% accuracy improvements on ImageNet-1K and ImageNet-21K.

03

Validated effectiveness across different architectures and distillation methods.

Abstract

Large-scale dataset distillation requires storing auxiliary soft labels that can be 30-40x larger on ImageNet-1K and 200x larger on ImageNet-21K than the condensed images, undermining the goal of dataset compression. We identify two fundamental issues necessitating such extensive labels: (1) insufficient image diversity, where high within-class similarity in synthetic images requires extensive augmentation, and (2) insufficient supervision diversity, where limited variety in supervisory signals during training leads to performance degradation at high compression rates. To address these challenges, we propose Label Pruning and Quantization for Large-scale Distillation (LPQLD). We enhance image diversity via class-wise batching and batch-normalization supervision during synthesis. For supervision diversity, we introduce Label Pruning with Dynamic Knowledge Reuse to improve…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

he-y/soft-label-pruning-quantization-for-dataset-distillation
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.