Towards Interpretable Deep Reinforcement Learning Models via Inverse Reinforcement Learning
Sean Xie, Soroush Vosoughi, Saeed Hassanpour

TL;DR
This paper introduces a novel framework using Adversarial Inverse Reinforcement Learning to provide global explanations for deep reinforcement learning models, enhancing interpretability by summarizing their decision-making processes.
Contribution
It presents a new approach that leverages inverse reinforcement learning to interpret and explain the behavior of deep reinforcement learning models globally.
Findings
Provides global explanations for RL decisions
Captures intuitive tendencies of models
Enhances interpretability of deep RL models
Abstract
Artificial intelligence, particularly through recent advancements in deep learning, has achieved exceptional performances in many tasks in fields such as natural language processing and computer vision. In addition to desirable evaluation metrics, a high level of interpretability is often required for these models to be reliably utilized. Therefore, explanations that offer insight into the process by which a model maps its inputs onto its outputs are much sought-after. Unfortunately, the current black box nature of machine learning models is still an unresolved issue and this very nature prevents researchers from learning and providing explicative descriptions for a model's behavior and final predictions. In this work, we propose a novel framework utilizing Adversarial Inverse Reinforcement Learning that can provide global explanations for decisions made by a Reinforcement Learning…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Explainable Artificial Intelligence (XAI) · Anomaly Detection Techniques and Applications
