Multimodal Dataset Distillation Made Simple by Prototype-Guided Data Synthesis
Junhyeok Choi, Sangwoo Mo, Minwoo Chae

TL;DR
This paper introduces a simple, learning-free multimodal dataset distillation method using CLIP and unCLIP, which synthesizes data efficiently and generalizes well across different model architectures, outperforming existing methods.
Contribution
The proposed framework eliminates the need for large-scale training and joint optimization, enabling scalable, architecture-agnostic multimodal dataset distillation.
Findings
Outperforms optimization-based distillation methods
Achieves state-of-the-art cross-architecture generalization
Efficiently synthesizes multimodal data without training
Abstract
Recent advances in multimodal learning have achieved remarkable success across diverse vision-language tasks. However, such progress heavily relies on large-scale image-text datasets, making training costly and inefficient. Prior efforts in dataset filtering and pruning attempt to mitigate this issue, but still require relatively large subsets to maintain performance and fail under very small subsets. Dataset distillation offers a promising alternative, yet existing multimodal dataset distillation methods require full-dataset training and joint optimization of image pixels and text features, making them architecture-dependent and limiting cross-architecture generalization. To overcome this, we propose a learning-free dataset distillation framework that eliminates the need for large-scale training and optimization while enhancing generalization across architectures. Our method uses CLIP…
Peer Reviews
Decision·ICLR 2026 Poster
- The core concept is both timely and important. The paper correctly identifies that summarizing a link between domains is a different and more nuanced problem than summarizing a single domain. This is a valuable contribution to the field. - The procedure described is logical and well-conceived. The idea of selecting only the overlapping pairs within matched clusters to form prototypes is a particularly clever mechanism for strengthening cross-modal alignment.
- A thing I found less convincing is what to do with these distilled, let’s say, 300 pairs. From what I understood it is not possible to use them to train a CLIP-like model from scratch (it would have been cool…). - The application suggested—to use these pairs to link a vision space with a text space through a fine-tuned linear layer—is weaker than it seems. The two spaces are often already very aligned (see e.g., e.g., Huh et al., 2024, "The Platonic Representation Hypothesis"), so it's reasona
The paper proposes a novel dataset distillation method based on a pre-trained CLIP encoder and unCLIP decoder to extract image embeddings. These extract embeddings are then forwarded in an unCLIP decoder to generate a distilled dataset. The paper's methodological presentation and its contributions are well articulated in the text. Empirically, PDS achieves state-of-the-art performance compared with dataset subset selection and multimodal dataset distillation baselines, demonstrating the advantag
- The paper's presented PDS framework is evaluated only with ViT-L/14 CLIP encoders. - The code to replicate results is not yet released (even anonymously). - Which unCLIP decoder is used in the PDS framework? Currently, the authors provide a citation to Ho & Salimans (2022) and the guidance scale and sampling step hyperparameters, without specifying the model architecture used.
The work proposes a strong case for need of more compact and easier produced datasets for multi-model training. The provided experiments are extensive and include possible ablation studies.
I am wondering if it is possible to prove in any way, that the learning result from such distilled\synthesized dataset is similar to the original learning result. The resulting performance is only a weak sign of similarity of the trained models. Also, I find it contradictory, that the motivation for the proposed method is based on the need for large-scale dataset existing beforehand, while this method as well requires trained CLIP-model, which means that this large scale dataset was already use
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Advanced Neural Network Applications
