Inducing, Detecting and Characterising Neural Modules: A Pipeline for Functional Interpretability in Reinforcement Learning

Anna Soligo; Pietro Ferraro; David Boyle

arXiv:2501.17077·cs.LG·June 3, 2025

Inducing, Detecting and Characterising Neural Modules: A Pipeline for Functional Interpretability in Reinforcement Learning

Anna Soligo, Pietro Ferraro, David Boyle

PDF

Open Access

TL;DR

This paper introduces a novel approach for interpretability in reinforcement learning by identifying functional modules through network sparsity and locality, using an extended Louvain algorithm with correlation alignment, validated in MiniGrid environments.

Contribution

It proposes a new modularity-based interpretability method for RL policies, overcoming scalability issues of neuron-level analysis, and introduces a correlation alignment metric for module detection.

Findings

01

Functional modules emerge in RL networks with sparsity and locality constraints.

02

Distinct navigational modules correspond to different axes in MiniGrid environments.

03

Interventions on network weights validate the functional roles of identified modules.

Abstract

Interpretability is crucial for ensuring RL systems align with human values. However, it remains challenging to achieve in complex decision making domains. Existing methods frequently attempt interpretability at the level of fundamental model units, such as neurons or decision nodes: an approach which scales poorly to large models. Here, we instead propose an approach to interpretability at the level of functional modularity. We show how encouraging sparsity and locality in network weights leads to the emergence of functional modules in RL policy networks. To detect these modules, we develop an extended Louvain algorithm which uses a novel `correlation alignment' metric to overcome the limitations of standard network analysis techniques when applied to neural network architectures. Applying these methods to 2D and 3D MiniGrid environments reveals the consistent emergence of distinct…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics · Topic Modeling · Anomaly Detection Techniques and Applications

MethodsALIGN