Are Large-scale Soft Labels Necessary for Large-scale Dataset   Distillation?

Lingao Xiao; Yang He

arXiv:2410.15919·cs.CV·November 5, 2024

Are Large-scale Soft Labels Necessary for Large-scale Dataset Distillation?

Lingao Xiao, Yang He

PDF

Open Access 1 Repo 3 Datasets 1 Video

TL;DR

This paper demonstrates that class-wise supervision during dataset distillation reduces the need for large soft labels by increasing diversity, enabling significant compression with performance gains.

Contribution

It introduces class-wise batching in dataset distillation, reducing soft label size and complexity while improving image diversity and performance.

Findings

01

40x reduction in soft label size

02

2.6% performance improvement

03

Effective soft label pruning method

Abstract

In ImageNet-condensation, the storage for auxiliary soft labels exceeds that of the condensed dataset by over 30 times. However, are large-scale soft labels necessary for large-scale dataset distillation? In this paper, we first discover that the high within-class similarity in condensed datasets necessitates the use of large-scale soft labels. This high within-class similarity can be attributed to the fact that previous methods use samples from different classes to construct a single batch for batch normalization (BN) matching. To reduce the within-class similarity, we introduce class-wise supervision during the image synthesizing process by batching the samples within classes, instead of across classes. As a result, we can increase within-class diversity and reduce the size of required soft labels. A key benefit of improved image diversity is that soft label compression can be…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

he-y/soft-label-pruning-for-dataset-distillation
pytorchOfficial

Datasets

Videos

Are Large-scale Soft Labels Necessary for Large-scale Dataset Distillation?· slideslive

Taxonomy

TopicsArtificial Immune Systems Applications · Data Stream Mining Techniques · Machine Learning and Data Classification

MethodsBatch Normalization