TL;DR
This paper introduces a novel approach combining inverse reinforcement learning and adversarial imitation learning to enable planning from observation-only demonstrations, improving sample efficiency and robustness in control tasks.
Contribution
It unifies IRL and adversarial imitation learning into a planning-based framework that learns from observations, enhancing interpretability and performance.
Findings
Significant improvements in sample efficiency.
Enhanced out-of-distribution generalization.
Robust performance in real-world navigation tasks.
Abstract
Human demonstration data is often ambiguous and incomplete, motivating imitation learning approaches that also exhibit reliable planning behavior. A common paradigm to perform planning-from-demonstration involves learning a reward function via Inverse Reinforcement Learning (IRL) then deploying this reward via Model Predictive Control (MPC). Towards unifying these methods, we derive a replacement of the policy in IRL with a planning-based agent. With connections to Adversarial Imitation Learning, this formulation enables end-to-end interactive learning of planners from observation-only demonstrations. In addition to benefits in interpretability, complexity, and safety, we study and observe significant improvements on sample efficiency, out-of-distribution generalization, and robustness. The study includes evaluations in both simulated control benchmarks and real-world navigation…
Peer Reviews
Decision·ICLR 2026 Poster
- The idea of unifying the reward learning (IRL) and online planning (MPC) into a single, end-to-end training process is new. - It provides good experimental evidence that using a planner as the generator improves generalization and robustness in out-of-distribution states compared to standard policy-based AIL methods. - The paper demonstrates successful real-world application through a "Real-Sim-Real" experiment, where the agent learns to navigate from just a single, noisy, and partially obse
- The method's scalability to high-dimensional action spaces is highly questionable. The paper relies on Model Predictive Path Integral (MPPI), a sample-based planner that suffers from the curse of dimensionality. While it works for low-dimensional actions like vehicle control (2D action), its performance degrades sharply as the action space grows. The paper's own experiment on the Ant environment (Figure 12), which has an 8-dimensional action space, demonstrates this weakness clearly: MPAIL sho
This paper presents a novel planning based policy optimization in AIL that matches well with the real deployment of the learned policy. This paper presents several theoretical results that connect the KL-constrained policy optimization using MPPI with the adversarial imitation learning objective. This paper demonstrates the application through deployment in a real world environment.
This paper targets the problem of learning from observational demonstrations in a POMDP setting. However, it seems that all the technical developments are based on the full state information. The major contribution comes from the usage of model-based planning algorithm MPPI for AIL, but the motivation of using MPPI instead of other methods require more discussion. Can other model-based planning methods be used? What is the advantage of MPPI and why other methods here are inferior? The paper is
1. The paper is well-organized and methodically builds from theoretical formulation to algorithm design and experiments, making a complex contribution accessible and reproducible. 2. The approach successfully handles observation-only demonstrations and partial observability, showing real-world viability for robot learning from minimal, ambiguous expert data. 3. The experiments span both simulated and real-world settings, demonstrating strong improvements in generalization, robustness, and sample
1. Most experiments focus on navigation and simple control tasks, so it’s unclear how well MPAIL would perform on more complex, high-dimensional problems like manipulation or multi-agent settings. 2. Although the few-shot setup is intentional, relying on very few demonstrations may make the results sensitive to the choice or quality of those examples, which isn’t extensively analyzed. 3. The paper could better isolate which components (e.g., MPPI planner, value bootstrapping, or reward formulati
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
