MathCoder-VL: Bridging Vision and Code for Enhanced Multimodal Mathematical Reasoning

Ke Wang; Junting Pan; Linda Wei; Aojun Zhou; Weikang Shi; Zimu Lu; Han Xiao; Yunqiao Yang; Houxing Ren; Mingjie Zhan; Hongsheng Li

arXiv:2505.10557·cs.CV·May 16, 2025

MathCoder-VL: Bridging Vision and Code for Enhanced Multimodal Mathematical Reasoning

Ke Wang, Junting Pan, Linda Wei, Aojun Zhou, Weikang Shi, Zimu Lu, Han Xiao, Yunqiao Yang, Houxing Ren, Mingjie Zhan, Hongsheng Li

PDF

Open Access 1 Repo 5 Models 4 Datasets 1 Video

TL;DR

MathCoder-VL introduces a novel approach to multimodal mathematical reasoning by leveraging code as supervision, creating large datasets and models that significantly improve problem-solving capabilities over existing models.

Contribution

The paper presents a new cross-modal alignment method using code supervision, along with the largest image-code dataset and a fine-tuned multimodal math model, achieving state-of-the-art results.

Findings

01

Surpasses GPT-4o and Claude 3.5 Sonnet in geometry problem-solving

02

Creates the largest image-code dataset to date

03

Achieves new SOTA across six evaluation metrics

Abstract

Natural language image-caption datasets, widely used for training Large Multimodal Models, mainly focus on natural scenarios and overlook the intricate details of mathematical figures that are critical for problem-solving, hindering the advancement of current LMMs in multimodal mathematical reasoning. To this end, we propose leveraging code as supervision for cross-modal alignment, since code inherently encodes all information needed to generate corresponding figures, establishing a precise connection between the two modalities. Specifically, we co-develop our image-to-code model and dataset with model-in-the-loop approach, resulting in an image-to-code model, FigCodifier and ImgCode-8.6M dataset, the largest image-code dataset to date. Furthermore, we utilize FigCodifier to synthesize novel mathematical figures and then construct MM-MathInstruct-3M, a high-quality multimodal math…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

mathllm/mathcoder
noneOfficial

Models

Datasets

Videos

MathCoder-VL: Bridging Vision and Code for Enhanced Multimodal Mathematical Reasoning· underline

Taxonomy

TopicsEducational Tools and Methods · Intelligent Tutoring Systems and Adaptive Learning

MethodsFocus