Compositional Concept-Based Neuron-Level Interpretability for Deep Reinforcement Learning
Zeyu Jiang, Hai Huang, Xingquan Zuo

TL;DR
This paper introduces a neuron-level interpretability method for deep reinforcement learning models, using concept-based explanations to improve transparency and align neural activations with human-understandable concepts.
Contribution
It presents a novel approach to interpret DRL networks by formalizing atomic concepts and analyzing neuron activations, enhancing understanding of internal decision mechanisms.
Findings
Effectively identifies meaningful concepts in DRL models
Aligns neuron activations with human-understandable concepts
Provides faithful explanations of network decision-making
Abstract
Deep reinforcement learning (DRL), through learning policies or values represented by neural networks, has successfully addressed many complex control problems. However, the neural networks introduced by DRL lack interpretability and transparency. Current DRL interpretability methods largely treat neural networks as black boxes, with few approaches delving into the internal mechanisms of policy/value networks. This limitation undermines trust in both the neural network models that represent policies and the explanations derived from them. In this work, we propose a novel concept-based interpretability method that provides fine-grained explanations of DRL models at the neuron level. Our method formalizes atomic concepts as binary functions over the state space and constructs complex concepts through logical operations. By analyzing the correspondence between neuron activations and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsExplainable Artificial Intelligence (XAI) · Fuzzy Logic and Control Systems · Adversarial Robustness in Machine Learning
MethodsALIGN
