VisAidMath: Benchmarking Visual-Aided Mathematical Reasoning
Jingkun Ma, Runzhe Zhan, Yang Li, Di Sun, Hou Pong Chan, Lidia S. Chao, Derek F. Wong

TL;DR
VisAidMath introduces a benchmark and evaluation framework to assess the true reasoning capabilities of multi-modal models in geometric problem-solving, revealing a gap between perceived accuracy and genuine visual reasoning.
Contribution
The paper presents a new benchmark and a three-layered evaluation framework that critically assess visual aid generation and reasoning, exposing limitations in current large multi-modal models.
Findings
High accuracy models often fail to generate valid visual aids.
Current models struggle with logical reasoning from visual information.
There is a disconnect between visual perception and reasoning in state-of-the-art models.
Abstract
A hallmark of advanced artificial intelligence is the capacity to progress from passive visual perception to the strategic modification of visual information to facilitate complex reasoning. This advanced capability, however, remains critically underdeveloped in current Large Multi-modal Models (LMMs). The deficiency is often masked by evaluation metrics that prioritize final-answer accuracy, creating an illusion of competence where genuine reasoning is absent. Using the domain of geometric problem-solving as a precise instrument, we probe this issue through tasks that require constructing visual aids. To this end, we introduce \textbf{VisAidMath}, a challenging benchmark, and our novel Three-Layered Funnel Evaluation Framework. This framework moves beyond simple accuracy (ACCU) to scrutinize the generation of valid visual aids (PVA) and the soundness of subsequent reasoning steps…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Visualization and Analytics
