Path Channels and Plan Extension Kernels: a Mechanistic Description of Planning in a Sokoban RNN

Mohammad Taufeeque; Aaron David Tucker; Adam Gleave; Adri\`a Garriga-Alonso

arXiv:2506.10138·cs.LG·December 5, 2025

Path Channels and Plan Extension Kernels: a Mechanistic Description of Planning in a Sokoban RNN

Mohammad Taufeeque, Aaron David Tucker, Adam Gleave, Adri\`a Garriga-Alonso

PDF

Open Access 1 Repo 3 Reviews

TL;DR

This paper reveals how a convolutional RNN trained with reinforcement learning for Sokoban encodes plans as activations in specific channels and uses learned kernels to perform bidirectional planning and backtracking.

Contribution

It provides a mechanistic understanding of how a model-free reinforcement learning RNN encodes and constructs plans in Sokoban through specialized channels and kernels.

Findings

01

Path channels encode future move plans.

02

Kernels encode transition models for position changes.

03

Backtracking is implemented via negative values in kernels.

Abstract

We partially reverse-engineer a convolutional recurrent neural network (RNN) trained with model-free reinforcement learning to play the box-pushing game Sokoban. We find that the RNN stores future moves (plans) as activations in particular channels of the hidden state, which we call path channels. A high activation in a particular location means that, when a box is in that location, it will get pushed in the channel's assigned direction. We examine the convolutional kernels between path channels and find that they encode the change in position resulting from each possible action, thus representing part of a learned transition model. The RNN constructs plans by starting at the boxes and goals. These kernels extend activations in path channels forwards from boxes and backwards from the goal. Negative values are placed in channels at obstacles. This causes the extension kernels to…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 8Confidence 4

Strengths

1. Plans live in identifiable channels; no probes needed to find them. 2. Concrete kernels for initialize/extend/stop/compete; elegant WTA tying to action selection. 3. Targeted edits flip actions at ~99% via GNA/PNA; large drops from path-channel ablations. 4. Long/short-term split with j-gate-mediated transfer explains overlapping plans. 5. The “weight steering” generalization and a clean recipe others can reuse.

Weaknesses

1. Risk of confirmation bias; limited inter-rater reliability reporting. 2. Define solve-rate protocol, seeds, and statistical tests more precisely; report CIs on all numbers. 3. One architecture (DRC(3,3)), one domain (Sokoban). 4. No check on other DRC sizes, Atari-DRC, or different training runs. 5. Interesting, but current evidence is circumstantial; could be toned down or supported with extra tests (e.g., explicit internal objective readouts). 6. Success metric is “any alternate action

Reviewer 02Rating 6Confidence 3

Strengths

I really do not have much to say. It may have some implications on the RL-trained the language models, especially those that reuse the same layer several times (e.g. diffusion language models, Universal Transformer, Sparse Universal Transformer) and use think tags.

Weaknesses

The analysis is performed on a particular set of weights of a particular RNN architecture using a particular dataset, random seed, etc., leaving it unclear whether the findings generalize to wider applications, or even to a different set of weights from a different random seed. (the authors acknowledged this in the appendix) The paper does not have a technical contribution. Rather, the authors simply manually found the existence of path channels, which renders the use of linear probes / logisti

Reviewer 03Rating 4Confidence 3

Strengths

This paper offers a detailed mechanistic analysis of the DRC(3,3) reinforcement learning agent trained on Sokoban. Its originality lies in the depth of interpretability achieved—the authors go beyond linear probes to reveal path channels and plan extension kernels that implement a neural form of bidirectional planning. The study combines qualitative visualization, quantitative ablation, and causal intervention experiments in a careful way. The causal and ablation results convincingly show that

Weaknesses

Despite its depth, the paper can be very difficult to read. Readers unfamiliar with DRC(3,3) may struggle to follow the layer/tick conventions and channel indexing. A schematic overview early in the paper would improve clarity. Another issue is limited generality: all findings are restricted to Sokoban. It remains unclear whether the same mechanisms emerge in other planning-heavy environments or different architectures. Some comparative evidence (e.g., other environments) would help validate t

Code & Models

Repositories

alignmentresearch/learned-planner
jax

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics · AI-based Problem Solving and Planning · Artificial Intelligence in Games