Math-LLaVA: Bootstrapping Mathematical Reasoning for Multimodal Large Language Models
Wenhao Shi, Zhiqiang Hu, Yi Bin, Junhua Liu, Yang Yang, See-Kiong Ng,, Lidong Bing, Roy Ka-Wei Lee

TL;DR
Math-LLaVA introduces a new multimodal dataset and model that significantly enhance the mathematical reasoning abilities of large language models by leveraging diverse, high-quality visual question-answer pairs.
Contribution
The paper presents MathV360K, a large diverse dataset, and Math-LLaVA, a fine-tuned model that improves multimodal mathematical reasoning performance.
Findings
19-point accuracy improvement over previous models
Achieves performance comparable to GPT-4V on MathVista minitest
Outperforms existing models on Math-V, MathVerse, and MMMU benchmarks
Abstract
Large language models (LLMs) have demonstrated impressive reasoning capabilities, particularly in textual mathematical problem-solving. However, existing open-source image instruction fine-tuning datasets, containing limited question-answer pairs per image, do not fully exploit visual information to enhance the multimodal mathematical reasoning capabilities of Multimodal LLMs (MLLMs). To bridge this gap, we address the lack of high-quality, diverse multimodal mathematical datasets by collecting 40K high-quality images with question-answer pairs from 24 existing datasets and synthesizing 320K new pairs, creating the MathV360K dataset, which enhances both the breadth and depth of multimodal mathematical questions. We introduce Math-LLaVA, a LLaVA-1.5-based model fine-tuned with MathV360K. This novel approach significantly improves the multimodal mathematical reasoning capabilities of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling
