Open Eyes, Then Reason: Fine-grained Visual Mathematical Understanding   in MLLMs

Shan Zhang; Aotian Chen; Yanpeng Sun; Jindong Gu; Yi-Yu Zheng; Piotr; Koniusz; Kai Zou; Anton van den Hengel; Yuan Xue

arXiv:2501.06430·cs.CV·January 14, 2025

Open Eyes, Then Reason: Fine-grained Visual Mathematical Understanding in MLLMs

Shan Zhang, Aotian Chen, Yanpeng Sun, Jindong Gu, Yi-Yu Zheng, Piotr, Koniusz, Kai Zou, Anton van den Hengel, Yuan Xue

PDF

1 Repo

TL;DR

This paper identifies the critical role of fine-grained visual understanding in mathematical reasoning within multimodal large language models and introduces a novel vision encoder to improve geometric primitive recognition.

Contribution

The paper proposes SVE-Math, a new model with a geometric-grounded vision encoder and feature router, enhancing visual primitive recognition and reasoning in MLLMs.

Findings

01

SVE-Math outperforms other 7B models by 15% on MathVerse.

02

SVE-Math achieves competitive results on GeoQA despite smaller training datasets.

03

Advanced models like GPT-4o have a 70% error rate in geometric entity recognition.

Abstract

Current multimodal large language models (MLLMs) often underperform on mathematical problem-solving tasks that require fine-grained visual understanding. The limitation is largely attributable to inadequate perception of geometric primitives during image-level contrastive pre-training (e.g., CLIP). While recent efforts to improve math MLLMs have focused on scaling up mathematical visual instruction datasets and employing stronger LLM backbones, they often overlook persistent errors in visual recognition. In this paper, we systematically evaluate the visual grounding capabilities of state-of-the-art MLLMs and reveal a significant negative correlation between visual grounding accuracy and problem-solving performance, underscoring the critical role of fine-grained visual understanding. Notably, advanced models like GPT-4o exhibit a 70% error rate when identifying geometric entities,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ai4math-shanzhang/sve-math
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.