Multimodal Dataset Distillation Made Simple by Prototype-Guided Data Synthesis

Junhyeok Choi; Sangwoo Mo; Minwoo Chae

arXiv:2602.19756·cs.CV·March 2, 2026

Multimodal Dataset Distillation Made Simple by Prototype-Guided Data Synthesis

Junhyeok Choi, Sangwoo Mo, Minwoo Chae

PDF

Open Access 3 Reviews

TL;DR

This paper introduces a simple, learning-free multimodal dataset distillation method using CLIP and unCLIP, which synthesizes data efficiently and generalizes well across different model architectures, outperforming existing methods.

Contribution

The proposed framework eliminates the need for large-scale training and joint optimization, enabling scalable, architecture-agnostic multimodal dataset distillation.

Findings

01

Outperforms optimization-based distillation methods

02

Achieves state-of-the-art cross-architecture generalization

03

Efficiently synthesizes multimodal data without training

Abstract

Recent advances in multimodal learning have achieved remarkable success across diverse vision-language tasks. However, such progress heavily relies on large-scale image-text datasets, making training costly and inefficient. Prior efforts in dataset filtering and pruning attempt to mitigate this issue, but still require relatively large subsets to maintain performance and fail under very small subsets. Dataset distillation offers a promising alternative, yet existing multimodal dataset distillation methods require full-dataset training and joint optimization of image pixels and text features, making them architecture-dependent and limiting cross-architecture generalization. To overcome this, we propose a learning-free dataset distillation framework that eliminates the need for large-scale training and optimization while enhancing generalization across architectures. Our method uses CLIP…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 6Confidence 4

Strengths

- The core concept is both timely and important. The paper correctly identifies that summarizing a link between domains is a different and more nuanced problem than summarizing a single domain. This is a valuable contribution to the field. - The procedure described is logical and well-conceived. The idea of selecting only the overlapping pairs within matched clusters to form prototypes is a particularly clever mechanism for strengthening cross-modal alignment.

Weaknesses

- A thing I found less convincing is what to do with these distilled, let’s say, 300 pairs. From what I understood it is not possible to use them to train a CLIP-like model from scratch (it would have been cool…). - The application suggested—to use these pairs to link a vision space with a text space through a fine-tuned linear layer—is weaker than it seems. The two spaces are often already very aligned (see e.g., e.g., Huh et al., 2024, "The Platonic Representation Hypothesis"), so it's reasona

Reviewer 02Rating 6Confidence 3

Strengths

The paper proposes a novel dataset distillation method based on a pre-trained CLIP encoder and unCLIP decoder to extract image embeddings. These extract embeddings are then forwarded in an unCLIP decoder to generate a distilled dataset. The paper's methodological presentation and its contributions are well articulated in the text. Empirically, PDS achieves state-of-the-art performance compared with dataset subset selection and multimodal dataset distillation baselines, demonstrating the advantag

Weaknesses

- The paper's presented PDS framework is evaluated only with ViT-L/14 CLIP encoders. - The code to replicate results is not yet released (even anonymously). - Which unCLIP decoder is used in the PDS framework? Currently, the authors provide a citation to Ho & Salimans (2022) and the guidance scale and sampling step hyperparameters, without specifying the model architecture used.

Reviewer 03Rating 6Confidence 4

Strengths

The work proposes a strong case for need of more compact and easier produced datasets for multi-model training. The provided experiments are extensive and include possible ablation studies.

Weaknesses

I am wondering if it is possible to prove in any way, that the learning result from such distilled\synthesized dataset is similar to the original learning result. The resulting performance is only a weak sign of similarity of the trained models. Also, I find it contradictory, that the motivation for the proposed method is based on the need for large-scale dataset existing beforehand, while this method as well requires trained CLIP-model, which means that this large scale dataset was already use

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Advanced Neural Network Applications