Explainable Reinforcement Learning Agents Using World Models

Madhuri Singh; Amal Alabdulkarim; Gennie Mansi; Mark O. Riedl

arXiv:2505.08073·cs.AI·August 19, 2025

Explainable Reinforcement Learning Agents Using World Models

Madhuri Singh, Amal Alabdulkarim, Gennie Mansi, Mark O. Riedl

PDF

TL;DR

This paper introduces a novel explainability method for model-based deep reinforcement learning agents using World Models and Reverse World Models to generate counterfactual explanations, enhancing user understanding of agent behavior.

Contribution

The paper presents a new approach combining World Models and Reverse World Models to produce interpretable counterfactual explanations for reinforcement learning agents.

Findings

01

Explanations improve user understanding of agent policies.

02

Counterfactual trajectories help users grasp why agents make certain decisions.

03

Method enhances transparency in sequential decision-making processes.

Abstract

Explainable AI (XAI) systems have been proposed to help people understand how AI systems produce outputs and behaviors. Explainable Reinforcement Learning (XRL) has an added complexity due to the temporal nature of sequential decision-making. Further, non-AI experts do not necessarily have the ability to alter an agent or its policy. We introduce a technique for using World Models to generate explanations for Model-Based Deep RL agents. World Models predict how the world will change when actions are performed, allowing for the generation of counterfactual trajectories. However, identifying what a user wanted the agent to do is not enough to understand why the agent did something else. We augment Model-Based RL agents with a Reverse World Model, which predicts what the state of the world should have been for the agent to prefer a given counterfactual action. We show that explanations…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.