Contrastive Sparse Autoencoders for Interpreting Planning of Chess-Playing Agents
Yoann Poupart

TL;DR
This paper introduces contrastive sparse autoencoders (CSAE), a novel interpretability framework for analyzing multi-step planning in chess AI systems, enabling extraction of meaningful concepts from game trajectories.
Contribution
The paper presents CSAE, a new method for interpreting complex planning processes in chess AI, addressing limitations of previous single-state interpretability techniques.
Findings
CSAE successfully extracts human-understandable planning concepts.
Qualitative analysis reveals meaningful features related to chess strategies.
Sanity checks validate the robustness of the interpretability results.
Abstract
AI led chess systems to a superhuman level, yet these systems heavily rely on black-box algorithms. This is unsustainable in ensuring transparency to the end-user, particularly when these systems are responsible for sensitive decision-making. Recent interpretability work has shown that the inner representations of Deep Neural Networks (DNNs) were fathomable and contained human-understandable concepts. Yet, these methods are seldom contextualised and are often based on a single hidden state, which makes them unable to interpret multi-step reasoning, e.g. planning. In this respect, we propose contrastive sparse autoencoders (CSAE), a novel framework for studying pairs of game trajectories. Using CSAE, we are able to extract and interpret concepts that are meaningful to the chess-agent plans. We primarily focused on a qualitative analysis of the CSAE features before proposing an automated…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSports Analytics and Performance · Explainable Artificial Intelligence (XAI) · Anomaly Detection Techniques and Applications
