MaD-Mix: Multi-Modal Data Mixtures via Latent Space Coupling for Vision-Language Model Training

Wanyun Xie; Francesco Tonin; Volkan Cevher

arXiv:2602.07790·cs.LG·February 10, 2026

MaD-Mix: Multi-Modal Data Mixtures via Latent Space Coupling for Vision-Language Model Training

Wanyun Xie, Francesco Tonin, Volkan Cevher

PDF

Open Access

TL;DR

MaD-Mix introduces a novel, efficient framework for creating multi-modal data mixtures in vision-language model training, reducing manual tuning and accelerating training while handling missing modalities.

Contribution

It formulates data mixing as modality-aware domain alignment maximization using closed-form scores, enabling scalable and automatic mixture design for VLMs.

Findings

01

Reduces training steps by 22% compared to human-tuned mixtures.

02

Accelerates training across diverse benchmarks with minimal overhead.

03

Improves accuracy in complex tri-modal scenarios.

Abstract

Vision-Language Models (VLMs) are typically trained on a diverse set of multi-modal domains, yet current practices rely on costly manual tuning. We propose MaD-Mix, a principled and computationally efficient framework that derives multi-modal data mixtures for VLM training. MaD-Mix formulates data mixing as modality-aware domain alignment maximization and obtains closed-form multi-modal alignment scores from the Fenchel dual through inter-modal coupling variables. MaD-Mix systematically handles domains with missing modalities, allowing for the integration of language-only domains. Empirical evaluations across 0.5B and 7B models demonstrate that MaD-Mix accelerates VLM training across diverse benchmarks. MaD-Mix matches human-tuned data mixtures using 22% fewer training steps in image-text instruction tuning. In complex tri-modal video-image-text scenarios, where manual tuning becomes…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Neural Network Applications