Taming Diffusion for Dataset Distillation with High Representativeness

Lin Zhao; Yushu Wu; Xinru Jiang; Jianyang Gu; Yanzhi Wang; Xiaolin Xu; Pu Zhao; Xue Lin

arXiv:2505.18399·cs.CV·May 27, 2025

Taming Diffusion for Dataset Distillation with High Representativeness

Lin Zhao, Yushu Wu, Xinru Jiang, Jianyang Gu, Yanzhi Wang, Xiaolin Xu, Pu Zhao, Xue Lin

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper introduces D^3HR, a diffusion-based framework for dataset distillation that enhances the representativeness of distilled datasets, leading to improved accuracy across various models.

Contribution

The paper proposes a novel diffusion-based method using DDIM inversion and an efficient sampling scheme to generate more representative distilled datasets.

Findings

01

D^3HR achieves higher accuracy than state-of-the-art methods.

02

The method maintains structural consistency of data.

03

Improved distribution matching in dataset distillation.

Abstract

Recent deep learning models demand larger datasets, driving the need for dataset distillation to create compact, cost-efficient datasets while maintaining performance. Due to the powerful image generation capability of diffusion, it has been introduced to this field for generating distilled images. In this paper, we systematically investigate issues present in current diffusion-based dataset distillation methods, including inaccurate distribution matching, distribution deviation with random noise, and separate sampling. Building on this, we propose D^3HR, a novel diffusion-based framework to generate distilled datasets with high representativeness. Specifically, we adopt DDIM inversion to map the latents of the full dataset from a low-normality latent domain to a high-normality Gaussian domain, preserving information and ensuring structural consistency to generate representative latents…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

lin-zhao-resolve/d3hr
pytorchOfficial

Videos

Taming Diffusion for Dataset Distillation with High Representativeness· slideslive

Taxonomy

TopicsNeural Networks and Applications · Data Stream Mining Techniques

MethodsADaptive gradient method with the OPTimal convergence rate · ALIGN