MAmmoTH-VL: Eliciting Multimodal Reasoning with Instruction Tuning at Scale
Jarvis Guo, Tuney Zheng, Yuelin Bai, Bo Li, Yubo Wang, King Zhu, Yizhi Li, Graham Neubig, Wenhu Chen, Xiang Yue

TL;DR
This paper introduces MAmmoTH-VL, a large-scale multimodal instruction-tuning dataset with rich rationales, significantly enhancing reasoning abilities of multimodal large language models and achieving state-of-the-art results.
Contribution
The authors present a scalable method to create a 12 million instruction-response dataset with detailed rationales, improving reasoning in multimodal models beyond prior datasets.
Findings
Achieved state-of-the-art performance on MathVerse, MMMU-Pro, and MuirBench benchmarks.
Significant improvements in reasoning capabilities with up to 13.3% gains.
Key dataset construction components like rewriting and self-filtering are crucial.
Abstract
Open-source multimodal large language models (MLLMs) have shown significant potential in a broad range of multimodal tasks. However, their reasoning capabilities remain constrained by existing instruction-tuning datasets, which were predominately repurposed from academic datasets such as VQA, AI2D, and ChartQA. These datasets target simplistic tasks, and only provide phrase-level answers without any intermediate rationales. To address these challenges, we introduce a scalable and cost-effective method to construct a large-scale multimodal instruction-tuning dataset with rich intermediate rationales designed to elicit CoT reasoning. Using only open models, we create a dataset containing 12M instruction-response pairs to cover diverse, reasoning-intensive tasks with detailed and faithful rationales. Experiments demonstrate that training MLLMs on this dataset significantly improves…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsIntelligent Tutoring Systems and Adaptive Learning
