Multi-Modal Dataset Distillation in the Wild

Zhuohang Dang; Minnan Luo; Chengyou Jia; Hangwei Qian; Xiaojun Chang; Ivor W. Tsang

arXiv:2506.01586·cs.CV·June 3, 2025

Multi-Modal Dataset Distillation in the Wild

Zhuohang Dang, Minnan Luo, Chengyou Jia, Hangwei Qian, Xiaojun Chang, Ivor W. Tsang

PDF

Open Access

TL;DR

This paper introduces MDW, a novel framework for distilling noisy multi-modal datasets into compact, clean datasets, improving training efficiency and robustness in multi-modal models with significant scalability and noise tolerance.

Contribution

MDW is the first framework to effectively distill noisy multi-modal datasets into clean, compact datasets using learnable correspondences and dual-track collaborative learning.

Findings

01

MDW surpasses prior methods by over 15% across various compression ratios.

02

MDW effectively handles noisy web-crawled multi-modal data.

03

Distilled datasets improve model training efficiency and robustness.

Abstract

Recent multi-modal models have shown remarkable versatility in real-world applications. However, their rapid development encounters two critical data challenges. First, the training process requires large-scale datasets, leading to substantial storage and computational costs. Second, these data are typically web-crawled with inevitable noise, i.e., partially mismatched pairs, severely degrading model performance. To these ends, we propose Multi-modal dataset Distillation in the Wild, i.e., MDW, the first framework to distill noisy multi-modal datasets into compact clean ones for effective and efficient model training. Specifically, MDW introduces learnable fine-grained correspondences during distillation and adaptively optimizes distilled data to emphasize correspondence-discriminative regions, thereby enhancing distilled data's information density and efficacy. Moreover, to capture…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsData Stream Mining Techniques · Water Quality Monitoring Technologies · Machine Learning and Data Classification