Interpretable Failure Analysis in Multi-Agent Reinforcement Learning Systems
Risal Shahriar Shefin, Debashis Gupta, Thai Le, Sarra Alqahtani

TL;DR
This paper presents a gradient-based framework for interpretable failure detection and analysis in multi-agent reinforcement learning systems, enabling diagnosis of failure sources and propagation pathways.
Contribution
It introduces a novel two-stage gradient analysis method that provides interpretable diagnostics for failure detection and propagation in MARL systems.
Findings
Achieves 88.2-99.4% accuracy in Patient-0 detection
Provides geometric evidence for failure propagation pathways
Effective across multiple MARL environments
Abstract
Multi-Agent Reinforcement Learning (MARL) is increasingly deployed in safety-critical domains, yet methods for interpretable failure detection and attribution remain underdeveloped. We introduce a two-stage gradient-based framework that provides interpretable diagnostics for three critical failure analysis tasks: (1) detecting the true initial failure source (Patient-0); (2) validating why non-attacked agents may be flagged first due to domino effects; and (3) tracing how failures propagate through learned coordination pathways. Stage 1 performs interpretable per-agent failure detection via Taylor-remainder analysis of policy-gradient costs, declaring an initial Patient-0 candidate at the first threshold crossing. Stage 2 provides validation through geometric analysis of critic derivatives-first-order sensitivity and directional second-order curvature aggregated over causal windows to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Reinforcement Learning in Robotics · Explainable Artificial Intelligence (XAI)
