OPIRL: Sample Efficient Off-Policy Inverse Reinforcement Learning via Distribution Matching
Hana Hoshino, Kei Ota, Asako Kanezaki, Rio Yokota

TL;DR
OPIRL introduces a sample-efficient off-policy IRL method that reduces environment interactions, learns transferable reward functions, and generalizes well across different environments and tasks.
Contribution
The paper proposes OPIRL, a novel off-policy IRL algorithm that improves sample efficiency and reward generalization compared to prior on-policy methods.
Findings
Significantly fewer environment interactions needed
Achieves comparable or better policy performance
Reward functions generalize across tasks and dynamics
Abstract
Inverse Reinforcement Learning (IRL) is attractive in scenarios where reward engineering can be tedious. However, prior IRL algorithms use on-policy transitions, which require intensive sampling from the current policy for stable and optimal performance. This limits IRL applications in the real world, where environment interactions can become highly expensive. To tackle this problem, we present Off-Policy Inverse Reinforcement Learning (OPIRL), which (1) adopts off-policy data distribution instead of on-policy and enables significant reduction of the number of interactions with the environment, (2) learns a stationary reward function that is transferable with high generalization capabilities on changing dynamics, and (3) leverages mode-covering behavior for faster convergence. We demonstrate that our method is considerably more sample efficient and generalizes to novel environments…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Fuel Cells and Related Materials
