GeoBench: Rethinking Multimodal Geometric Problem-Solving via Hierarchical Evaluation
Yuan Feng, Yue Yang, Xiaohan He, Jiatong Zhao, Jianlong Chen, Zijun Chen, Daocheng Fu, Qi Liu, Renqiu Xia, Bo Zhang, Junchi Yan

TL;DR
GeoBench introduces a hierarchical, multi-level benchmark for geometric reasoning in vision-language models, highlighting the importance of sub-goal decomposition and revealing performance challenges with complex tasks.
Contribution
The paper presents GeoBench, a novel hierarchical benchmark with verified tasks to systematically evaluate geometric reasoning in vision-language models, addressing existing evaluation limitations.
Findings
Reasoning models outperform MLLMs but struggle with complex tasks.
Sub-goal decomposition improves accuracy significantly.
Chain-of-Thought prompting can sometimes reduce performance.
Abstract
Geometric problem solving constitutes a critical branch of mathematical reasoning, requiring precise analysis of shapes and spatial relationships. Current evaluations of geometric reasoning in vision-language models (VLMs) face limitations, including the risk of test data contamination from textbook-based benchmarks, overemphasis on final answers over reasoning processes, and insufficient diagnostic granularity. To address these issues, we present GeoBench, a hierarchical benchmark featuring four reasoning levels in geometric problem-solving: Visual Perception, Goal-Oriented Planning, Rigorous Theorem Application, and Self-Reflective Backtracking. Through six formally verified tasks generated via TrustGeoGen, we systematically assess capabilities ranging from attribute extraction to logical error correction. Experiments reveal that while reasoning models like OpenAI-o3 outperform…
Peer Reviews
Decision·ICLR 2026 Poster
1. The hierarchical framework is a useful diagnostic tool to provide actionable insights into pitfalls in the reasoning process. 2. The benchmark presented in this work goes beyond data collation and adds a unique set of information for analysing model performance.
1, This work does not detail how the automatically generated tasks are verified for accuracy and legitimacy 2. Likewise, there is no insight into how the automatically-generated problems are distributed in terms of logical and reasoning complexity. In addition to the empirical comparison against established benchmarks and their levels, the work could benefit from a deeper, and qualitative, analysis of the complexity and difficulty of the problems in this benchmark,
1. It identifies key limitations in geometric reasoning evaluation (e.g., data contamination, overemphasis on answers) and addresses them through GeoBench—a hierarchical benchmark that decomposes geometric reasoning into distinct stages 2. The benchmark leverages the TrustGeoGen methodology to generate tasks verified for logical rigor, ensuring data novelty and mitigating contamination risks. This establishes a reliable foundation for equitable model evaluation.
--The evaluation framework lacks a necessary human verification step. Given the complexity of the dataset problems (as shown in Figure 4), establishing a performance baseline from human experts is crucial. Furthermore, the scope of evaluation should be expanded to include advanced mathematical reasoning agents—particularly those capable of using tools for exploration or constructing auxiliary lines—in order to assess the true capabilities of current models under problem-solving paradigms that cl
I liked the clarity in the paper's writing and the results are comprehensive, and have two strengths to highlight: - Hierarchical evaluation grounded in cognitive theory: The benchmark’s structure, inspired by the van Hiele model, allows precise diagnosis of reasoning abilities rather than measuring final-answer accuracy alone. - Comprehensive and formally verified dataset: Using TrustGeoGen ensures rigorous, contamination-free problem generation, making GeoBench a strong diagnostic tool for e
The benchmark relies on synthetic, clean diagrams and controlled premises. This limits assessment of robustness to real-world variability such as hand-drawn figures, scanned textbook noise, ambiguous markings, and imperfect annotations. Adding a real-diagram slice or perturbation suite would strengthen ecological validity.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsConstraint Satisfaction and Optimization · Multimodal Machine Learning Applications · Spatial Cognition and Navigation
