Imitation Learning via Off-Policy Distribution Matching
Ilya Kostrikov, Ofir Nachum, Jonathan Tompson

TL;DR
This paper introduces ValueDICE, an off-policy distribution matching method for imitation learning that improves data efficiency and eliminates the need for separate reinforcement learning steps, achieving state-of-the-art results.
Contribution
It transforms the distribution ratio estimation into an off-policy objective, enabling direct imitation policy learning without explicit rewards.
Findings
Achieves state-of-the-art sample efficiency in benchmarks.
Eliminates the need for separate RL optimization.
Demonstrates superior performance over existing methods.
Abstract
When performing imitation learning from expert demonstrations, distribution matching is a popular approach, in which one alternates between estimating distribution ratios and then using these ratios as rewards in a standard reinforcement learning (RL) algorithm. Traditionally, estimation of the distribution ratio requires on-policy data, which has caused previous work to either be exorbitantly data-inefficient or alter the original objective in a manner that can drastically change its optimum. In this work, we show how the original distribution ratio estimation objective may be transformed in a principled manner to yield a completely off-policy objective. In addition to the data-efficiency that this provides, we are able to show that this objective also renders the use of a separate RL optimization unnecessary.Rather, an imitation policy may be learned directly from this objective…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Multimodal Machine Learning Applications · Human Pose and Action Recognition
