Unveiling the Black Box: A Multi-Layer Framework for Explaining Reinforcement Learning-Based Cyber Agents

Diksha Goel; Kristen Moore; Jeff Wang; Minjune Kim; and Thanh Thi Nguyen

arXiv:2505.11708·cs.CR·May 18, 2026

Unveiling the Black Box: A Multi-Layer Framework for Explaining Reinforcement Learning-Based Cyber Agents

Diksha Goel, Kristen Moore, Jeff Wang, Minjune Kim, and Thanh Thi Nguyen

PDF

TL;DR

This paper introduces a multi-layer explainability framework for reinforcement learning cyber agents, revealing strategic and tactical reasoning to improve trust, debugging, and threat understanding in cybersecurity.

Contribution

It presents a unified, domain-agnostic explainability approach that models cyberattacks as POMDPs and analyzes policy evolution, surpassing prior post-hoc methods.

Findings

01

Framework provides interpretable insights across increasing complexity environments.

02

Reveals exploration-exploitation dynamics and behavioral shifts.

03

Supports use cases like threat modeling and RL policy debugging.

Abstract

Reinforcement Learning (RL) agents are increasingly used to simulate sophisticated cyberattacks, but their decision-making processes remain opaque, hindering trust, debugging, and defensive preparedness. In high-stakes cybersecurity contexts, explainability is essential for understanding how adversarial strategies are formed and evolve over time. In this paper, we propose a unified, multi-layer explainability framework for RL-based attacker agents that reveals both strategic (Markov Decision Process (MDP)-level) and tactical (policy-level) reasoning. At the MDP-level, we model cyberattacks as a Partially Observable Markov Decision Process (POMDP) to expose exploration-exploitation dynamics and phase-aware behavioural shifts. At the policy-level, we analyse the temporal evolution of Q-values and use Prioritised Experience Replay (PER) to surface critical learning transitions and evolving…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Explainable Artificial Intelligence (XAI) · Information and Cyber Security

MethodsExperience Replay