GeoGramBench: Benchmarking the Geometric Program Reasoning in Modern LLMs
Shixian Luo, Zezhou Zhu, Yu Yuan, Yuncheng Yang, Lianlei Shan, Yong Wu

TL;DR
GeoGramBench is a new benchmark designed to evaluate large language models' ability to perform geometric reasoning from programmatic descriptions, revealing significant challenges and deficiencies in current models.
Contribution
The paper introduces GeoGramBench, a comprehensive benchmark with a three-level taxonomy for geometric reasoning, and provides a systematic evaluation of 17 LLMs highlighting their limitations.
Findings
Most models achieve less than 50% accuracy at high abstraction levels.
Current LLMs struggle with program-driven spatial reasoning.
GeoGramBench serves as a valuable resource for future research.
Abstract
Geometric spatial reasoning forms the foundation of many applications in artificial intelligence, yet the ability of large language models (LLMs) to operate over geometric spatial information expressed in procedural code remains underexplored. In this paper, we address this gap by formalizing the Program-to-Geometry task, which challenges models to translate programmatic drawing code into accurate and abstract geometric reasoning. To evaluate this capability, we present GeoGramBench, a benchmark of 500 carefully refined problems organized by a tailored three-level taxonomy that considers geometric complexity rather than traditional mathematical reasoning complexity. Our comprehensive evaluation of 17 frontier LLMs reveals consistent and pronounced deficiencies: even the most advanced models achieve less than 50% accuracy at the highest abstraction level. These results highlight the…
Peer Reviews
Decision·ICLR 2026 Poster
1. Code-based spatial reasoning is an interesting and under-explored domain. 2. Clear task definition and a taxonomy grounded in geometric (not step-count) complexity; the benchmark isolates program-to-diagram understanding from general math difficulty. 3. Broad evaluation over 17 models with consistent prompting and multi-sample reporting; the “<50% on Abstract” result is a useful negative finding that signals a genuine capability gap. 4. The ANSWER LEAKAGE CHALLENGES is sound and interesting.
1. Positioning vs. prior art (major). The paper does not cite SGP-Bench (ICLR 2025), which evaluates LLMs’ understanding of symbolic graphics programs without rendering as well, at broader scope (SVG for 2D; CAD for 2D/3D) and with systematic invariance tests under program transformations. While GeoGram focuses more on the 2D geometry math aspect, I think SGP-Bench is closely related and predates this submission, it should be discussed and contrasted. 2. GeoGramBench stays within Asymptote/Matpl
- interesting insights have been provided in the section 6 "behavior analysis of LLMs" - detailed description of the data curation process (pre-selection, deduplication, human verification and answer leakage prevention)
- a more detailed analysis for the main results table in Section 5.3 is required, it is interesting that they struggle with angle and volumne, but lacks a bit of insights (for example why is this not the most challenging questions for abstract?) - I am not entirely sure when the experiments have been drafted, but maybe add one most recent model for evaluation? - The tasks seems undoubtly interesting, and one could see the implication, but testing the model to make geometric reasoning based on pr
- The authors effectively address prevalent issues in similar datasets, such as answer leakage, through meticulous processing techniques that significantly elevate the benchmark's overall quality. - The evaluation spans 17 open- and closed-source models of varying scales, uncovering both the capabilities and shortcomings of existing LLMs in this domain. - The paper is generally well-written, with smooth and coherent flow throughout.
- It would be valuable to incorporate evaluations using vision models that directly interpret the rendered images, alongside results from multimodal models, to serve as a reference baseline. This would help disentangle whether the observed performance gaps arise from deficiencies in reasoning capabilities or challenges in parsing Asymptote code. - There is a discrepancy between Table 1 and the corresponding text on page 8: "For example, GPT-o1 drops from 76.02% to 43.35%, and DeepSeek-R1 drops f
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
