Beyond Accuracy: A Geometric Stability Analysis of Large Language Models in Chess Evaluation
Xidan Song, Weiqi Wang, Ruifeng Cao, Qingya Hu

TL;DR
This paper introduces a Geometric Stability Framework to evaluate large language models' reasoning in chess, revealing that high accuracy often masks poor conceptual understanding under geometric transformations.
Contribution
It proposes a novel evaluation method for LLMs that tests geometric stability, exposing limitations of accuracy as a sole performance metric in reasoning tasks.
Findings
GPT-5.1 shows high accuracy but poor stability under transformations.
Claude Sonnet 4.5 and Kimi K2 Turbo maintain high consistency across transformations.
Geometric stability correlates with reasoning robustness beyond standard accuracy.
Abstract
The evaluation of Large Language Models (LLMs) in complex reasoning domains typically relies on performance alignment with ground-truth oracles. In the domain of chess, this standard manifests as accuracy benchmarks against strong engines like Stockfish. However, high scalar accuracy does not necessarily imply robust conceptual understanding. This paper argues that standard accuracy metrics fail to distinguish between genuine geometric reasoning and the superficial memorization of canonical board states. To address this gap, we propose a Geometric Stability Framework, a novel evaluation methodology that rigorously tests model consistency under invariant transformations-including board rotation, mirror symmetry, color inversion, and format conversion. We applied this framework to a comparative analysis of six state-of-the-art LLMs including GPT-5.1, Claude Sonnet 4.5, and Kimi K2 Turbo,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsArtificial Intelligence in Games · Topic Modeling · Artificial Intelligence in Healthcare and Education
