Beyond Modality Collapse: Representations Blending for Multimodal Dataset Distillation
Xin Zhang, Ziruo Zhang, Jiawei Du, Zuozhu Liu, and Joey Tianyi Zhou

TL;DR
This paper introduces RepBlend, a novel framework for multimodal dataset distillation that alleviates modality collapse by blending representations and balancing supervision, leading to improved cross-modal learning and efficiency.
Contribution
RepBlend is the first method to address modality collapse in MDD by representation blending and symmetric projection matching, enhancing intra-modal diversity and cross-modal alignment.
Findings
Outperforms prior MDD methods on Flickr-30K and MS-COCO
Achieves up to 9.4 IR@10 and 6.3 TR@10 improvements
Provides up to 6.7× speedup in distillation
Abstract
Multimodal Dataset Distillation (MDD) seeks to condense large-scale image-text datasets into compact surrogates while retaining their effectiveness for cross-modal learning. Despite recent progress, existing MDD approaches often suffer from \textit{\textbf{Modality Collapse}}, characterized by over-concentrated intra-modal representations and enlarged distributional gap across modalities. In this paper, at the first time, we identify this issue as stemming from a fundamental conflict between the over-compression behavior inherent in dataset distillation and the cross-modal supervision imposed by contrastive objectives. To alleviate modality collapse, we introduce \textbf{RepBlend}, a novel MDD framework that weakens overdominant cross-modal supervision via representation blending, thereby significantly enhancing intra-modal diversity. Additionally, we observe that current MDD methods…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSemantic Web and Ontologies
