Do Large Language Models Truly Understand Geometric Structures?
Xiaofeng Wang, Yiming Wang, Wenhong Zhu, Rui Wang

TL;DR
This paper introduces a new dataset and method to evaluate and improve large language models' understanding of geometric structures, addressing limitations of existing benchmarks and enhancing their spatial reasoning capabilities.
Contribution
The paper presents the GeomRel dataset for targeted evaluation and the GeoCoT method to improve LLMs' geometric relationship understanding.
Findings
GeomRel effectively isolates geometric reasoning skills.
GeoCoT significantly improves LLM performance on geometric tasks.
Identifies key limitations in current LLM geometric understanding.
Abstract
Geometric ability is a significant challenge for large language models (LLMs) due to the need for advanced spatial comprehension and abstract thinking. Existing datasets primarily evaluate LLMs on their final answers, but they cannot truly measure their true understanding of geometric structures, as LLMs can arrive at correct answers by coincidence. To fill this gap, we introduce the GeomRel dataset, designed to evaluate LLMs' understanding of geometric structures by isolating the core step of geometric relationship identification in problem-solving. Using this benchmark, we conduct thorough evaluations of diverse LLMs and identify key limitations in understanding geometric structures. We further propose the Geometry Chain-of-Thought (GeoCoT) method, which enhances LLMs' ability to identify geometric relationships, resulting in significant performance improvements.
Peer Reviews
Decision·ICLR 2025 Poster
- This paper introduces a new benchmark that reveals current LLMs struggle to effectively recognize geometric structures. - It demonstrates that even with few-shot prompting or fine-tuning, current LLMs do not perform well in recognizing geometric relationships. To address this, the paper proposes a two-stage pipeline that guides LLMs to decompose and observe geometric structures. - Experimental results in the final section show that the proposed two-stage pipeline effectively enhances LLM perfo
- This paper dedicates significant space to describing the rules and various data augmentation methods used in constructing the benchmark. However, the overall process and rationale for construction could be presented more clearly. - In Section 3.5, the paper uses the LLaMA-3-8B-Instruct model as the base model for fine-tuning, which is somewhat unconventional, as it is more typical to fine-tune base models on math-related datasets. - Including additional experimental metrics commonly used in ma
1. The GeomRel dataset, which isolates geometric relationship identification as a key step, provides a novel and focused way to evaluate LLMs' geometric understanding. This dataset addresses a unique gap in the field and enables more focused evaluation of geometric reasoning capabilities. 2. The Geometry Chain-of-Thought (GeoCoT) method improves LLMs’ performance in identifying geometric relationships by breaking down problems into reasoning steps. The proposed method significantly improved bot
1. Although fine-tuning on GeomRel was attempted, the results only report results of a single model. More experiments exploring different models could clarify which models has performance improvements with fine-tuning. 2. It is difficult to truly understand the difficulty of the dataset for human intelligence. This can be understood by sampling the dataset and carrying out a systematic human evaluation to report human baselines on these geometrics reasoning problems. 3. GeomRel, while valuable f
* The paper establishes GRI as a new task, shifting focus beyond answer accuracy to intermediate steps in spatial reasoning. * The proposed dataset, GeomRel, is valuable, especially with its advanced split that incorporates logical chains, indeterminate cases, and extraneous information, mimicking real-world problem complexity and ambiguity. * The paper provides a thorough evaluation across multiple LLMs and reasoning methods, yielding insights into their spatial reasoning capacities and limit
* For disambiguation, the authors manually reviewed and excluded ambiguous data throughout the data construction process, which may reduce scalability and limit others’ ability to expand the benchmark. * The paper’s focus on GRI as an isolated skill may be too narrow, leaving it uncertain if success in GRI tasks will translate to general spatial reasoning or even multi-step problem-solving abilities. * The two-stage GeoCoT method, which involves decomposing geometry problems and reverse reasonin
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling
