M$^3$-ACE: Rectifying Visual Perception in Multimodal Math Reasoning via Multi-Agentic Context Engineering
Peijin Xie, Zhen Xu, Bingquan Liu, Baoxun Wang

TL;DR
This paper introduces M3-ACE, a multi-agent framework that improves multimodal math reasoning by collaboratively rectifying visual perception errors, leading to state-of-the-art performance on several benchmarks.
Contribution
The paper proposes a novel multi-agent context engineering approach that decouples perception and reasoning, enhancing visual evidence extraction in multimodal math reasoning models.
Findings
Achieves 89.1% accuracy on MathVision benchmark.
Significantly improves performance on MathVista and MathVerse datasets.
Demonstrates the effectiveness of perception correction in multimodal reasoning.
Abstract
Multimodal large language models have recently shown promising progress in visual mathematical reasoning. However, their performance is often limited by a critical yet underexplored bottleneck: inaccurate visual perception. Through systematic analysis, we find that the most failures originate from incorrect or incomplete visual evidence extraction rather than deficiencies in reasoning capability. Moreover, models tend to remain overly confident in their initial perceptions, making standard strategies such as prompt engineering, multi-round self-reflection, or posterior guidance insufficient to reliably correct errors. To address this limitation, we propose M3-ACE, a multi-agentic context engineering framework designed to rectify visual perception in multimodal math reasoning. Instead of directly aggregating final answers, our approach decouples perception and reasoning by dynamically…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Data Visualization and Analytics · Advanced Graph Neural Networks
