MaD-Mix: Multi-Modal Data Mixtures via Latent Space Coupling for Vision-Language Model Training
Wanyun Xie, Francesco Tonin, Volkan Cevher

TL;DR
MaD-Mix introduces a novel, efficient framework for creating multi-modal data mixtures in vision-language model training, reducing manual tuning and accelerating training while handling missing modalities.
Contribution
It formulates data mixing as modality-aware domain alignment maximization using closed-form scores, enabling scalable and automatic mixture design for VLMs.
Findings
Reduces training steps by 22% compared to human-tuned mixtures.
Accelerates training across diverse benchmarks with minimal overhead.
Improves accuracy in complex tri-modal scenarios.
Abstract
Vision-Language Models (VLMs) are typically trained on a diverse set of multi-modal domains, yet current practices rely on costly manual tuning. We propose MaD-Mix, a principled and computationally efficient framework that derives multi-modal data mixtures for VLM training. MaD-Mix formulates data mixing as modality-aware domain alignment maximization and obtains closed-form multi-modal alignment scores from the Fenchel dual through inter-modal coupling variables. MaD-Mix systematically handles domains with missing modalities, allowing for the integration of language-only domains. Empirical evaluations across 0.5B and 7B models demonstrate that MaD-Mix accelerates VLM training across diverse benchmarks. MaD-Mix matches human-tuned data mixtures using 22% fewer training steps in image-text instruction tuning. In complex tri-modal video-image-text scenarios, where manual tuning becomes…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Neural Network Applications
