Mechanistic Interpretability of Reinforcement Learning Agents

Tristan Trim; Triston Grayston

arXiv:2411.00867·cs.LG·November 5, 2024

Mechanistic Interpretability of Reinforcement Learning Agents

Tristan Trim, Triston Grayston

PDF

Open Access

TL;DR

This paper investigates how reinforcement learning agents make decisions by analyzing neural networks trained on maze tasks, revealing internal features, biases, and developing tools for interpretability.

Contribution

It introduces methods for dissecting RL neural networks, identifies biases like goal misgeneralization, and develops interactive tools for layer activation exploration.

Findings

01

Identified fundamental features like maze walls and pathways in neural networks.

02

Discovered biases such as consistent navigation strategies without explicit goals.

03

Developed tools for visualizing and exploring layer activations.

Abstract

This paper explores the mechanistic interpretability of reinforcement learning (RL) agents through an analysis of a neural network trained on procedural maze environments. By dissecting the network's inner workings, we identified fundamental features like maze walls and pathways, forming the basis of the model's decision-making process. A significant observation was the goal misgeneralization, where the RL agent developed biases towards certain navigation strategies, such as consistently moving towards the top right corner, even in the absence of explicit goals. Using techniques like saliency mapping and feature mapping, we visualized these biases. We furthered this exploration with the development of novel tools for interactively exploring layer activations.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStatistical and Computational Modeling