TL;DR
This paper introduces a causal modeling framework for explaining reinforcement learning policies by learning simplified high-level causal models that respond accurately to interventions, helping to understand policy successes and failures.
Contribution
It presents a novel nonlinear causal model reduction method that ensures approximate interventional consistency, providing meaningful explanations of complex RL policies.
Findings
Successfully applied to synthetic and real RL tasks
Uncovered behavioral patterns and biases in policies
Identified failure modes in RL policies
Abstract
Why do reinforcement learning (RL) policies fail or succeed? This is a challenging question due to the complex, high-dimensional nature of agent-environment interactions. In this work, we take a causal perspective on explaining the behavior of RL policies by viewing the states, actions, and rewards as variables in a low-level causal model. We introduce random perturbations to policy actions during execution and observe their effects on the cumulative reward, learning a simplified high-level causal model that explains these relationships. To this end, we develop a nonlinear Causal Model Reduction framework that ensures approximate interventional consistency, meaning the simplified high-level model responds to interventions in a similar way as the original complex system. We prove that for a class of nonlinear causal models, there exists a unique solution that achieves exact…
Peer Reviews
Decision·ICLR 2026 Poster
1. **Bridging causality and RL interpretability**: It connects causal abstraction theory with reinforcement learning policy analysis—a growing but underexplored intersection—and provides a unified, formal framework to derive explanations that are causally meaningful rather than purely correlational. 2. **Conceptually grounded contribution**: The paper builds upon rigorous causal inference foundations (structural causal models, causal abstraction, targeted reductions) and extends them into nonli
1. **Dependence on interventional assumptions**: The method relies on shift interventions on continuous actions and Gaussianity assumptions at the high level, limiting its applicability to discrete or stochastic policy settings. Future work directions mention this but downplay its importance. 2a. **Interpretability trade-off**: Although Gaussian kernel maps improve interpretability, the framework still produces dense, high-dimensional weight maps that may be difficult to interpret without signi
The problem addressed is highly important for RL. Framing it from the perspective of policy-level decision-making explainability is both interesting and technically challenging, as it requires linking internal policy mechanisms to long-term performance. Advancing this line of work could meaningfully improve the interpretability, debugging, and reliability of RL systems in complex settings. The paper is well written and easy to follow, and the theoretical part appears strong. The core idea is c
The application domain of the method is not fully clear. Please articulate the types of scenarios where it is intended to be used and detail the constraints introduced by the surjectivity assumption. As a concrete edge case: if the long-term (episodic) return is binary—i.e., a single scalar only at the final step—does the method still apply, and under what conditions could the method be applied? Additionally, for readability, it would be helpful to include a worked example in the introduction
- The problem is well motivated and relevant. - The theoretical contribution seems rigorous.
- The writing in the paper is quite dense in several areas, specifically in sections 4 and 5. - Lack of comparison with similar approaches for causal reduction. - Evaluation is limited to specific use cases/configurations.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
