A Survey on Explainable Deep Reinforcement Learning
Zelei Cheng, Jiahao Yu, Xinyu Xing

TL;DR
This survey reviews explainable deep reinforcement learning methods, their evaluation, and integration with large language models, aiming to improve transparency, trust, and safety in AI decision-making systems.
Contribution
It provides a comprehensive overview of XRL techniques, assessment frameworks, and explores the integration of RL with LLMs like RLHF for better AI alignment.
Findings
XRL enhances transparency at multiple levels
Evaluation frameworks for XRL are established
Integration of RL with LLMs improves AI alignment
Abstract
Deep Reinforcement Learning (DRL) has achieved remarkable success in sequential decision-making tasks across diverse domains, yet its reliance on black-box neural architectures hinders interpretability, trust, and deployment in high-stakes applications. Explainable Deep Reinforcement Learning (XRL) addresses these challenges by enhancing transparency through feature-level, state-level, dataset-level, and model-level explanation techniques. This survey provides a comprehensive review of XRL methods, evaluates their qualitative and quantitative assessment frameworks, and explores their role in policy refinement, adversarial robustness, and security. Additionally, we examine the integration of reinforcement learning with Large Language Models (LLMs), particularly through Reinforcement Learning from Human Feedback (RLHF), which optimizes AI alignment with human preferences. We conclude by…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Explainable Artificial Intelligence (XAI) · Machine Learning in Healthcare
