Unveiling the Black Box: A Multi-Layer Framework for Explaining Reinforcement Learning-Based Cyber Agents
Diksha Goel, Kristen Moore, Jeff Wang, Minjune Kim, and Thanh Thi Nguyen

TL;DR
This paper introduces a multi-layer explainability framework for reinforcement learning cyber agents, revealing strategic and tactical reasoning to improve trust, debugging, and threat understanding in cybersecurity.
Contribution
It presents a unified, domain-agnostic explainability approach that models cyberattacks as POMDPs and analyzes policy evolution, surpassing prior post-hoc methods.
Findings
Framework provides interpretable insights across increasing complexity environments.
Reveals exploration-exploitation dynamics and behavioral shifts.
Supports use cases like threat modeling and RL policy debugging.
Abstract
Reinforcement Learning (RL) agents are increasingly used to simulate sophisticated cyberattacks, but their decision-making processes remain opaque, hindering trust, debugging, and defensive preparedness. In high-stakes cybersecurity contexts, explainability is essential for understanding how adversarial strategies are formed and evolve over time. In this paper, we propose a unified, multi-layer explainability framework for RL-based attacker agents that reveals both strategic (Markov Decision Process (MDP)-level) and tactical (policy-level) reasoning. At the MDP-level, we model cyberattacks as a Partially Observable Markov Decision Process (POMDP) to expose exploration-exploitation dynamics and phase-aware behavioural shifts. At the policy-level, we analyse the temporal evolution of Q-values and use Prioritised Experience Replay (PER) to surface critical learning transitions and evolving…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Explainable Artificial Intelligence (XAI) · Information and Cyber Security
MethodsExperience Replay
