MixAtlas: Uncertainty-aware Data Mixture Optimization for Multimodal LLM Midtraining

Bingbing Wen; Sirajul Salekin; Feiyang Kang; Bill Howe; Lucy Lu Wang; Javier Movellan; Manjot Bilkhu

arXiv:2604.14198·cs.LG·April 17, 2026

MixAtlas: Uncertainty-aware Data Mixture Optimization for Multimodal LLM Midtraining

Bingbing Wen, Sirajul Salekin, Feiyang Kang, Bill Howe, Lucy Lu Wang, Javier Movellan, Manjot Bilkhu

PDF

TL;DR

MixAtlas introduces a novel data mixture optimization method for multimodal training, improving performance and efficiency across various benchmarks by decomposing data along visual and task axes.

Contribution

It proposes a new approach to optimize multimodal data mixtures using proxy models and Gaussian-process optimization, enabling better performance and transferability.

Findings

01

Optimized mixtures improve performance by up to 17.6% on benchmarks.

02

Training reaches baseline loss in up to half the steps.

03

Mixtures discovered on small proxies transfer effectively to larger models.

Abstract

Domain reweighting can improve sample efficiency and downstream generalization, but data-mixture optimization for multimodal midtraining remains largely unexplored. Current multimodal training recipes tune mixtures along a single dimension, typically data format or task type. We introduce MixAtlas, a method that produces benchmark-targeted data recipes that can be inspected, adapted, and transferred to new corpora. MixAtlas decomposes the training corpus along two axes: image concepts (10 visual-domain clusters discovered via CLIP embeddings) and task supervision (5 objective types including captioning, OCR, grounding, detection, and VQA). Using small proxy models (Qwen2-0.5B) paired with a Gaussian-process surrogate and GP-UCB acquisition, MixAtlas searches the resulting mixture space with the same proxy budget as regression-based baselines but finds better-performing mixtures. We…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.