TL;DR
This paper identifies the critical role of fine-grained visual understanding in mathematical reasoning within multimodal large language models and introduces a novel vision encoder to improve geometric primitive recognition.
Contribution
The paper proposes SVE-Math, a new model with a geometric-grounded vision encoder and feature router, enhancing visual primitive recognition and reasoning in MLLMs.
Findings
SVE-Math outperforms other 7B models by 15% on MathVerse.
SVE-Math achieves competitive results on GeoQA despite smaller training datasets.
Advanced models like GPT-4o have a 70% error rate in geometric entity recognition.
Abstract
Current multimodal large language models (MLLMs) often underperform on mathematical problem-solving tasks that require fine-grained visual understanding. The limitation is largely attributable to inadequate perception of geometric primitives during image-level contrastive pre-training (e.g., CLIP). While recent efforts to improve math MLLMs have focused on scaling up mathematical visual instruction datasets and employing stronger LLM backbones, they often overlook persistent errors in visual recognition. In this paper, we systematically evaluate the visual grounding capabilities of state-of-the-art MLLMs and reveal a significant negative correlation between visual grounding accuracy and problem-solving performance, underscoring the critical role of fine-grained visual understanding. Notably, advanced models like GPT-4o exhibit a 70% error rate when identifying geometric entities,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
