A Theoretical Framework for Explaining Reinforcement Learning with Shapley Values
Daniel Beechey, Thomas M. S. Smith, \"Ozg\"ur \c{S}im\c{s}ek

TL;DR
This paper introduces SVERL, a theoretical framework using Shapley values to explain reinforcement learning agents' behaviour, outcomes, and predictions with mathematically justified, interpretable explanations.
Contribution
It develops a unified, axiomatic approach for explaining RL agents through feature influence, addressing interpretability and conceptual clarity issues.
Findings
SVERL provides precise, interpretable explanations of RL agents.
The framework identifies and corrects conceptual issues in prior explanations.
Illustrative examples demonstrate the usefulness of SVERL in understanding agent behaviour.
Abstract
Reinforcement learning agents can achieve super-human performance in complex decision-making tasks, but their behaviour is often difficult to understand and explain. This lack of explanation limits deployment, especially in safety-critical settings where understanding and trust are essential. We identify three core explanatory targets that together provide a comprehensive view of reinforcement learning agents: behaviour, outcomes, and predictions. We develop a unified theoretical framework for explaining these three elements of reinforcement learning agents through the influence of individual features that the agent observes in its environment. We derive feature influences by using Shapley values, which collectively and uniquely satisfy a set of well-motivated axioms for fair and consistent credit assignment. The proposed approach, Shapley Values for Explaining Reinforcement Learning…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsExplainable Artificial Intelligence (XAI) · Reinforcement Learning in Robotics · Adversarial Robustness in Machine Learning
MethodsALIGN
