TL;DR
This paper introduces GeoRepEval, a framework for evaluating the robustness of large language models in geometric reasoning across different problem representations, revealing significant accuracy gaps and potential for improvement.
Contribution
The paper presents a novel representation-aware evaluation framework and metrics, along with empirical findings on LLMs' sensitivity to geometric problem representations.
Findings
Accuracy gaps up to 14 percentage points due to representation choice.
Vector formulations are a consistent failure point with Invariance@3 as low as 0.044.
Convert-then-solve prompting improves vector accuracy significantly for high-capacity models.
Abstract
Large language models (LLMs) are increasingly evaluated on mathematical reasoning, yet their robustness to equivalent problem representations remains poorly understood. In geometry, identical problems can be expressed in Euclidean, coordinate, or vector forms, but existing benchmarks report accuracy on fixed formats, implicitly assuming representation invariance and masking failures caused by representational changes alone. We propose GeoRepEval, a representation-aware evaluation framework that measures correctness, invariance, and consistency at the problem level across parallel formulations, combining strict answer matching, bootstrap confidence intervals, paired McNemar tests, representation-flip analyses, and regression controls for surface complexity. We prove that our Invariance@3 metric decomposes accuracy into robust and fragile components and is bounded by the weakest…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
