Trust Region Inverse Reinforcement Learning: Explicit Dual Ascent using Local Policy Updates
Anish Diwan, Davide Tateo, Christopher E. Mower, Haitham Bou-Ammar, Jan Peters, Oleg Arenz

TL;DR
This paper introduces TRIRL, a novel IRL method that ensures monotonic policy improvement and stable reward learning by combining trust-region policy updates with explicit dual optimization.
Contribution
It bridges classical and adversarial IRL approaches by enabling stable, monotonic reward and policy updates without fully solving RL problems each iteration.
Findings
TRIRL outperforms state-of-the-art imitation learning methods by 2.4x in aggregate inter-quartile mean.
TRIRL learns reward functions that generalize across system dynamics shifts.
The method avoids training instabilities of adversarial approaches.
Abstract
Inverse reinforcement learning (IRL) is typically formulated as maximizing entropy subject to matching the distribution of expert trajectories. Classical (dual-ascent) IRL guarantees monotonic performance improvement but requires fully solving an RL problem each iteration to compute dual gradients. More recent adversarial methods avoid this cost at the expense of stability and monotonic dual improvement, by directly optimizing the primal problem and using a discriminator to provide rewards. In this work, we bridge the gap between these approaches by enabling monotonic improvement of the reward function and policy without having to fully solve an RL problem at every iteration. Our key theoretical insight is that a trust-region-optimal policy for a reward function update can be globally optimal for a smaller update in the same direction. This smaller update allows us to explicitly…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
