Generalization or Memorization? Brittleness Testing for Chess-Trained Language Models
Ethan Tang

TL;DR
This paper critically evaluates chess-trained language models, revealing their pattern-matching nature and demonstrating how verifier-in-the-loop frameworks significantly improve move accuracy and validity, offering a flexible alternative to domain-specific training.
Contribution
It introduces KinGPT, a character-level model trained on chess data, and shows how verifier-in-the-loop methods enhance performance, challenging claims of understanding in existing models.
Findings
KinGPT outperforms larger models on chess puzzles.
Verifier-in-the-loop improves move accuracy from 1.2% to 21.2%.
Open source code and models for reproducibility.
Abstract
Recent work has fine-tuned language models on chess data and reported high benchmark scores as evidence that the resulting models can understand the rules of chess, play full chess games at a professional level, or generate human-readable explanations grounded in expert knowledge. We train KinGPT, a 25M-parameter character-level language model trained only on (position, best-move) pairs, who exceeds 3B-parameter ChessGPT on a 600-puzzle mate-in-N suite and 4B-parameter C1-4B over a 20-theme puzzle benchmark. We examine several claims made in existing literature regarding chess-trained language models and assert that their impressive benchmark performance is largely explained by pattern-matching. We also demonstrate how LLM-Modulo, a verifier-in-the-loop framework, raises RedPajama 3B's best move accuracy from 1.2% to 21.2% and move generation validity from 19.3% to 95.3% on mate-in-N…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
