SAS: Semantic-aware Sampling for Generative Dataset Distillation
Mingzhuo Li, Guang Li, Linfeng Ye, Jiafeng Mao, Takahiro Ogawa, Konstantinos N. Plataniotis, Miki Haseyama

TL;DR
This paper introduces SAS, a semantic-aware sampling method for dataset distillation that leverages CLIP to produce compact, semantically rich datasets, improving downstream model performance.
Contribution
It proposes a novel semantic scoring and sampling strategy using CLIP to enhance dataset distillation with semantic relevance and diversity.
Findings
Consistent performance improvements across multiple datasets.
Effective filtering of semantically relevant samples.
Enhanced semantic class discrimination in distilled datasets.
Abstract
Deep neural networks have achieved impressive performance across a wide range of tasks, but this success often comes with substantial computational and storage costs due to large-scale training data. Dataset distillation addresses this challenge by constructing compact yet informative datasets that enable efficient model training while maintaining downstream performance. However, most existing approaches primarily emphasize matching data distributions or downstream training statistics, with limited attention to preserving high-level semantic information in the distilled data. In this work, we introduce a semantic-aware perspective for dataset distillation by leveraging Contrastive Language-Image Pretraining (CLIP) as a semantic prior for post-sampling. Our goal is to obtain distilled datasets that are not only compact but also semantically class-discriminative and diverse. To this end,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
