CovMatch: Cross-Covariance Guided Multimodal Dataset Distillation with Trainable Text Encoder
Yongmin Lee, Hye Won Chung

TL;DR
CovMatch introduces a scalable multimodal dataset distillation method that jointly optimizes image and text encoders to improve cross-modal alignment and retrieval performance with fewer synthetic pairs.
Contribution
It proposes CovMatch, a novel framework that aligns cross-covariance of features and enables joint encoder optimization, surpassing prior methods that froze text encoders.
Findings
Outperforms state-of-the-art methods on Flickr30K and COCO.
Achieves up to 6.8% absolute gains in retrieval accuracy.
Uses only 500 synthetic pairs for training.
Abstract
Multimodal dataset distillation aims to synthesize a small set of image-text pairs that enables efficient training of large-scale vision-language models. While dataset distillation has shown promise in unimodal tasks, extending it to multimodal contrastive learning presents key challenges: learning cross-modal alignment and managing the high computational cost of large encoders. Prior approaches address scalability by freezing the text encoder and update only the image encoder and text projection layer. However, we find this severely limits semantic alignment and becomes a bottleneck for performance scaling. We propose CovMatch, a scalable dataset distillation framework that aligns the cross-covariance of real and synthetic features while regularizing feature distributions within each modality. Unlike prior approaches, CovMatch enables joint optimization of both encoders, leading to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Domain Adaptation and Few-Shot Learning
