Perspectives for Direct Interpretability in Multi-Agent Deep Reinforcement Learning
Yoann Poupart, Aur\'elie Beynier, Nicolas Maudet

TL;DR
This paper explores methods for directly interpreting multi-agent deep reinforcement learning models post-training, providing insights into agent behavior and emergent phenomena without modifying the models.
Contribution
It advocates for a scalable, model-agnostic approach to interpretability in MADRL using post hoc explanation techniques and discusses future research directions.
Findings
Relevance backpropagation and other methods can explain agent decisions.
Post hoc explanations reveal emergent behaviors and biases.
Strategies improve understanding of multi-agent coordination.
Abstract
Multi-Agent Deep Reinforcement Learning (MADRL) was proven efficient in solving complex problems in robotics or games, yet most of the trained models are hard to interpret. While learning intrinsically interpretable models remains a prominent approach, its scalability and flexibility are limited in handling complex tasks or multi-agent dynamics. This paper advocates for direct interpretability, generating post hoc explanations directly from trained models, as a versatile and scalable alternative, offering insights into agents' behaviour, emergent phenomena, and biases without altering models' architectures. We explore modern methods, including relevance backpropagation, knowledge edition, model steering, activation patching, sparse autoencoders and circuit discovery, to highlight their applicability to single-agent, multi-agent, and training process challenges. By addressing MADRL…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsExplainable Artificial Intelligence (XAI) · Adversarial Robustness in Machine Learning · Machine Learning and Data Classification
