Mechanistic Interpretability of Reinforcement Learning Agents
Tristan Trim, Triston Grayston

TL;DR
This paper investigates how reinforcement learning agents make decisions by analyzing neural networks trained on maze tasks, revealing internal features, biases, and developing tools for interpretability.
Contribution
It introduces methods for dissecting RL neural networks, identifies biases like goal misgeneralization, and develops interactive tools for layer activation exploration.
Findings
Identified fundamental features like maze walls and pathways in neural networks.
Discovered biases such as consistent navigation strategies without explicit goals.
Developed tools for visualizing and exploring layer activations.
Abstract
This paper explores the mechanistic interpretability of reinforcement learning (RL) agents through an analysis of a neural network trained on procedural maze environments. By dissecting the network's inner workings, we identified fundamental features like maze walls and pathways, forming the basis of the model's decision-making process. A significant observation was the goal misgeneralization, where the RL agent developed biases towards certain navigation strategies, such as consistently moving towards the top right corner, even in the absence of explicit goals. Using techniques like saliency mapping and feature mapping, we visualized these biases. We furthered this exploration with the development of novel tools for interactively exploring layer activations.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStatistical and Computational Modeling
