LLM CHESS: Benchmarking Reasoning and Instruction-Following in LLMs through Chess

Sai Kolasani; Maxim Saplin; Nicholas Crispino; Kyle Montgomery; Jared Quincy Davis; Matei Zaharia; Chi Wang; Chenguang Wang

arXiv:2512.01992·cs.AI·December 2, 2025

LLM CHESS: Benchmarking Reasoning and Instruction-Following in LLMs through Chess

Sai Kolasani, Maxim Saplin, Nicholas Crispino, Kyle Montgomery, Jared Quincy Davis, Matei Zaharia, Chi Wang, Chenguang Wang

PDF

Open Access 3 Reviews

TL;DR

LLM CHESS is a new benchmarking framework that evaluates large language models' reasoning and instruction-following abilities through extended chess interactions, revealing significant performance gaps and robustness challenges.

Contribution

This work introduces LLM CHESS, a novel dynamic benchmark for assessing reasoning and instruction-following in LLMs via chess, including a ranking system and public resources.

Findings

01

Many state-of-the-art models struggle to complete games or win consistently.

02

A clear separation exists between reasoning and non-reasoning models.

03

The benchmark's stochastic nature reduces overfitting and prevents saturation.

Abstract

We introduce LLM CHESS, an evaluation framework designed to probe the generalization of reasoning and instruction-following abilities in large language models (LLMs) through extended agentic interaction in the domain of chess. We rank over 50 open and closed source models by playing against a random opponent using a range of behavioral metrics, including win and loss rates, move quality, move legality, hallucinated actions, and game duration. For a subset of top reasoning models, we derive an Elo estimate by playing against a chess engine with variably configured skill, which allows for comparisons between models in an easily understandable way. Despite the simplicity of the instruction-following task and the weakness of the opponent, many state-of-the-art models struggle to complete games or achieve consistent wins. Similar to other benchmarks on complex reasoning tasks, our…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 4Confidence 4

Strengths

1. The agentic framework is a key innovation. It not only evaluates the quality of moves but also tests the model's integrated ability to follow instructions and use tools. The design is highly scalable—difficulty can be increased by simply raising the opponent's skill level—ensuring the benchmark's long-term relevance. 2. The breadth of the study, covering over 50 models, is commendable. This large scale provides a robust foundation for drawing conclusions about the current state of LLMs on thi

Weaknesses

1. A core part of the analysis distinguishes between "reasoning-enhanced" and "standard" models. The argument would be strengthened if the paper provided a more explicit and operational definition for this classification (e.g., based on specific test-time algorithms, architectural features, or training methods). 2. The analysis could be enriched by more granular case studies of errors. For instance, analyzing the types of mistakes (distinguishing between simple tactical blunders and deeper strat

Reviewer 02Rating 6Confidence 4

Strengths

* The paper is clear and easy to follow * Overall feels like a solid work * I like how you setup Figure 4a (and b)

Weaknesses

* The primary contribution is the resource here. However, it is unclear what information/inferences use of the resource will offer. What will future users of the benchmark learn from the results? (See questions)

Reviewer 03Rating 4Confidence 3

Strengths

1. The choice of chess as the testbed is conceptually solid. It naturally embodies combinatorial search, long-horizon planning, and rule-based reasoning, making it a meaningful domain. 2. The analyses and experiments are extensive. The consistent advantage of reasoning-enhanced “thinking” models over standard LLMs provides credible support for the benchmark’s claims. 3. The framework is reproducible and extensible, with open code, public leaderboards, and adjustable opponent strengths, allowing

Weaknesses

1. Most LLMs obtain nearly zero Win/Loss in Table 4, suggesting that the current difficulty curve may be poorly calibrated. It remains unclear whether the benchmark measures reasoning limitations or simply overwhelms models with excessive interaction complexity. 2. Figure 1 conveys little information, with too much large white space and minimal data illustration. Core analyses such as the ablation in experiments should be added into the main passage instead of in the appendix. 3. The agentic int

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsArtificial Intelligence in Games · Topic Modeling · Multimodal Machine Learning Applications