MathCoder-VL: Bridging Vision and Code for Enhanced Multimodal Mathematical Reasoning
Ke Wang, Junting Pan, Linda Wei, Aojun Zhou, Weikang Shi, Zimu Lu, Han Xiao, Yunqiao Yang, Houxing Ren, Mingjie Zhan, Hongsheng Li

TL;DR
MathCoder-VL introduces a novel approach to multimodal mathematical reasoning by leveraging code as supervision, creating large datasets and models that significantly improve problem-solving capabilities over existing models.
Contribution
The paper presents a new cross-modal alignment method using code supervision, along with the largest image-code dataset and a fine-tuned multimodal math model, achieving state-of-the-art results.
Findings
Surpasses GPT-4o and Claude 3.5 Sonnet in geometry problem-solving
Creates the largest image-code dataset to date
Achieves new SOTA across six evaluation metrics
Abstract
Natural language image-caption datasets, widely used for training Large Multimodal Models, mainly focus on natural scenarios and overlook the intricate details of mathematical figures that are critical for problem-solving, hindering the advancement of current LMMs in multimodal mathematical reasoning. To this end, we propose leveraging code as supervision for cross-modal alignment, since code inherently encodes all information needed to generate corresponding figures, establishing a precise connection between the two modalities. Specifically, we co-develop our image-to-code model and dataset with model-in-the-loop approach, resulting in an image-to-code model, FigCodifier and ImgCode-8.6M dataset, the largest image-code dataset to date. Furthermore, we utilize FigCodifier to synthesize novel mathematical figures and then construct MM-MathInstruct-3M, a high-quality multimodal math…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsEducational Tools and Methods · Intelligent Tutoring Systems and Adaptive Learning
MethodsFocus
