Automatically Finding Rule-Based Neurons in OthelloGPT

Aditya Singh; Zihang Wen; Srujananjali Medicherla; Adam Karvonen; Can Rager

arXiv:2511.00059·cs.LG·November 4, 2025

Automatically Finding Rule-Based Neurons in OthelloGPT

Aditya Singh, Zihang Wen, Srujananjali Medicherla, Adam Karvonen, Can Rager

PDF

Open Access

TL;DR

This paper introduces an automated decision tree-based method to interpret neurons in OthelloGPT, revealing rule-based patterns and verifying their causal role in move prediction.

Contribution

It presents a novel approach to automatically extract human-readable rule-based neuron interpretations in a transformer model for Othello, enabling better understanding of its internal logic.

Findings

01

Approximately half of the neurons are described by rule-based decision trees with high accuracy.

02

Targeted neuron ablations significantly impair move prediction along identified patterns.

03

Provides a Python tool for mapping game behaviors to neurons for future interpretability research.

Abstract

OthelloGPT, a transformer trained to predict valid moves in Othello, provides an ideal testbed for interpretability research. The model is complex enough to exhibit rich computational patterns, yet grounded in rule-based game logic that enables meaningful reverse-engineering. We present an automated approach based on decision trees to identify and interpret MLP neurons that encode rule-based game logic. Our method trains regression decision trees to map board states to neuron activations, then extracts decision paths where neurons are highly active to convert them into human-readable logical forms. These descriptions reveal highly interpretable patterns; for instance, neurons that specifically detect when diagonal moves become legal. Our findings suggest that roughly half of the neurons in layer 5 can be accurately described by compact, rule-based decision trees ( $R^{2} > 0.7$ for 913 of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsExplainable Artificial Intelligence (XAI) · Artificial Intelligence in Games · Adversarial Robustness in Machine Learning