ChessQA: Evaluating Large Language Models for Chess Understanding
Qianfeng Wen, Zhenwei Tang, and Ashton Anderson

TL;DR
ChessQA is a comprehensive benchmark designed to evaluate large language models' understanding of chess across multiple skill levels and task categories, providing a detailed and evolving assessment framework.
Contribution
The paper introduces ChessQA, a new benchmark that systematically measures LLMs' chess understanding across five categories, surpassing previous narrow evaluations.
Findings
Persistent weaknesses across all categories in current LLMs
Evaluation reveals varied performance by model size and training methods
Provides a platform for ongoing research and model improvement
Abstract
Chess provides an ideal testbed for evaluating the reasoning, modeling, and abstraction capabilities of large language models (LLMs), as it has well-defined structure and objective ground truth while admitting a wide spectrum of skill levels. However, existing evaluations of LLM ability in chess are ad hoc and narrow in scope, making it difficult to accurately measure LLM chess understanding and how it varies with scale, post-training methodologies, or architecture choices. We present ChessQA, a comprehensive benchmark that assesses LLM chess understanding across five task categories (Structural, Motifs, Short Tactics, Position Judgment, and Semantic), which approximately correspond to the ascending abstractions that players master as they accumulate chess knowledge, from understanding basic rules and learning tactical motifs to correctly calculating tactics, evaluating positions, and…
Peer Reviews
Decision·Submitted to ICLR 2026
Strengths: 1. Clear task categorization mirroring human learning trajectories: Standardized notations (FEN/PGN/UCI) along with the five-category curriculum mirrors typical human learning trajectories-from rules to explanation, yielding a richer, multi-dimensional assessment than single “best-move” tasks helping assess where modern llms lack in terms of capabilities 2. Deterministic scoring: The benchmark leans on exact-match outputs, canonicalization (e.g., sorted sets for legal moves), and eng
Weaknesses: 1. No proper backing for fixed five-option mapping: Mapping engine cp to the fixed five-option grid {−400, −200, 0, 200, 400} is coarse and unclear why these boundaries serve well compared against any other setting. For example, a seven-option grid or a three-option grid or different values in the five-option grid itself 2. No Depth-Based Tactical Reasoning Analysis: The benchmark does not assess whether models can think/think how many moves ahead and make tactical sacrifices. These
1. The construction of ChessQA is described clearly, with detailed specifications, prompt formatting, canonicalization rules, and task generation algorithms. The authors also provide reproducibility details and plan for public release. 2. The authors conducted extensive and comprehensive evaluations and experiments. 3. The authors provided detailed error analysis, which brings more insights beyond just accuracy.
1. As there are not enough examples provided in the paper this raises further concerns regarding the format and quality of the data. 2. The literature review misses several recent relevant papers that are directly pertinent to the goals and context of ChessQA, especially those benchmarking LLMs in chess and grid-based games, evaluating state tracking [1,2] [1] Kuo, Mu-Tien et al. “Large Language Models on the Chessboard: A Study on ChatGPT's Formal Language Comprehension and Complex Reasoni
* Comprehensive task coverage in chess understanding. ChessQA covers 50 diverse tasks, offering a thorough and fine-grained evaluation of chess-related reasoning capabilities. * Extensive model evaluation. The benchmark is evaluated across 13 different models, and explicitly accounts for thinking modes, providing a broad perspective on model behavior. * In-Depth Analysis of Results. The paper conducts detailed analyses, including token efficiency and performance scaling, offering insights into b
* Limited per-task coverage. The dataset includes 3,500 examples spanning 50 task types, resulting in an average of only 70 examples per task. This relatively small number may limit reliable evaluation for individual task categories. * No human baseline. The paper does not report human performance, either from laypeople or domain experts, which makes it difficult to contextualize the difficulty of the tasks and assess how far current models are from human-level understanding. * Lack of discussio
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
