Automatically Finding Rule-Based Neurons in OthelloGPT
Aditya Singh, Zihang Wen, Srujananjali Medicherla, Adam Karvonen, Can Rager

TL;DR
This paper introduces an automated decision tree-based method to interpret neurons in OthelloGPT, revealing rule-based patterns and verifying their causal role in move prediction.
Contribution
It presents a novel approach to automatically extract human-readable rule-based neuron interpretations in a transformer model for Othello, enabling better understanding of its internal logic.
Findings
Approximately half of the neurons are described by rule-based decision trees with high accuracy.
Targeted neuron ablations significantly impair move prediction along identified patterns.
Provides a Python tool for mapping game behaviors to neurons for future interpretability research.
Abstract
OthelloGPT, a transformer trained to predict valid moves in Othello, provides an ideal testbed for interpretability research. The model is complex enough to exhibit rich computational patterns, yet grounded in rule-based game logic that enables meaningful reverse-engineering. We present an automated approach based on decision trees to identify and interpret MLP neurons that encode rule-based game logic. Our method trains regression decision trees to map board states to neuron activations, then extracts decision paths where neurons are highly active to convert them into human-readable logical forms. These descriptions reveal highly interpretable patterns; for instance, neurons that specifically detect when diagonal moves become legal. Our findings suggest that roughly half of the neurons in layer 5 can be accurately described by compact, rule-based decision trees ( for 913 of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsExplainable Artificial Intelligence (XAI) · Artificial Intelligence in Games · Adversarial Robustness in Machine Learning
