MixAtlas: Uncertainty-aware Data Mixture Optimization for Multimodal LLM Midtraining
Bingbing Wen, Sirajul Salekin, Feiyang Kang, Bill Howe, Lucy Lu Wang, Javier Movellan, Manjot Bilkhu

TL;DR
MixAtlas introduces a novel data mixture optimization method for multimodal training, improving performance and efficiency across various benchmarks by decomposing data along visual and task axes.
Contribution
It proposes a new approach to optimize multimodal data mixtures using proxy models and Gaussian-process optimization, enabling better performance and transferability.
Findings
Optimized mixtures improve performance by up to 17.6% on benchmarks.
Training reaches baseline loss in up to half the steps.
Mixtures discovered on small proxies transfer effectively to larger models.
Abstract
Domain reweighting can improve sample efficiency and downstream generalization, but data-mixture optimization for multimodal midtraining remains largely unexplored. Current multimodal training recipes tune mixtures along a single dimension, typically data format or task type. We introduce MixAtlas, a method that produces benchmark-targeted data recipes that can be inspected, adapted, and transferred to new corpora. MixAtlas decomposes the training corpus along two axes: image concepts (10 visual-domain clusters discovered via CLIP embeddings) and task supervision (5 objective types including captioning, OCR, grounding, detection, and VQA). Using small proxy models (Qwen2-0.5B) paired with a Gaussian-process surrogate and GP-UCB acquisition, MixAtlas searches the resulting mixture space with the same proxy budget as regression-based baselines but finds better-performing mixtures. We…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
