Disentangling generalization and memorization in large language models using chess
Leonard S. Pleiss, Maximilian Schiffer, Robert K. von Weizsaecker

TL;DR
This paper uses chess as a testbed to analyze whether large language models rely on memorization or genuine reasoning, revealing limitations in their ability to generalize when relevant priors are scarce.
Contribution
It introduces a taxonomy based on chess positions to distinguish memorization from reasoning in LLMs without needing training data knowledge.
Findings
Performance drops as relevant priors decrease.
Models regress to random baseline on sparse prior tasks.
Reasoning-augmented inference offers limited gains without relevant priors.
Abstract
Large Language Models (LLMs) exhibit remarkable capabilities, yet it remains unclear to what extent these reflect sophisticated recall or genuine reasoning ability. We introduce chess as a controlled testbed aimed at disentangling these faculties. Leveraging the game's structure and scalable engine evaluations, we construct a taxonomy of positions varying in density of relevant priors - ranging from common states solvable by memorization to completely novel ones requiring generalization. Crucially, our approach achieves this distinction without requiring explicit knowledge of the models' training data. Applying this taxonomy, we combine a longitudinal analysis of the GPT lineage with a rigorous evaluation of contemporary models, including Claude Opus and Gemini. Our analysis reveals a steep gradient: performance consistently degrades as the density of relevant priors decreases. Notably,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsArtificial Intelligence in Healthcare and Education · Explainable Artificial Intelligence (XAI) · Topic Modeling
