Value from Observations: Towards Large-Scale Imitation Learning via Self-Improvement
Michael Bloesch, Markus Wulfmeier, Philemon Brakel, Todor Davchev, Martina Zambelli, Jost Tobias Springenberg, Abbas Abdolmaleki, William F Whitney, Nicolas Heess, Roland Hafner, Martin Riedmiller

TL;DR
This paper advances large-scale imitation learning from observation by developing a method that leverages value functions to learn from nuanced, real-world data distributions, enabling iterative self-improvement.
Contribution
It introduces a novel RL-based imitation learning approach that effectively utilizes action-free demonstrations with complex data distributions for scalable behavior learning.
Findings
The method improves learning from diverse data distributions.
It highlights limitations of existing algorithms in realistic scenarios.
Provides insights for more robust imitation learning techniques.
Abstract
Imitation Learning from Observation (IfO) offers a powerful way to learn behaviors at large-scale: Unlike behavior cloning or offline reinforcement learning, IfO can leverage action-free demonstrations and thus circumvents the need for costly action-labeled demonstrations or reward functions. However, current IfO research focuses on idealized scenarios with mostly bimodal-quality data distributions, restricting the meaningfulness of the results. In contrast, this paper investigates more nuanced distributions and introduces a method to learn from such data, moving closer to a paradigm in which imitation learning can be performed iteratively via self-improvement. Our method adapts RL-based imitation learning to action-free demonstrations, using a value function to transfer information between expert and non-expert data. Through comprehensive evaluation, we delineate the relation between…
Peer Reviews
Decision·Submitted to ICLR 2025
*Addresses an Important Problem*: The paper tackles the practical challenge of learning from heterogeneous demonstration data, which is valuable in scenarios where expert action labels are unavailable but expert state observations are accessible. *Novel Methodology*: Extend existing methods that combine value function learning and policy fitting, supervised through either binary rewards or discriminator predictions, which is a creative approach to leveraging available data. *Application to Div
- *Insufficient Practical Motivation*: The paper lacks clear examples of real-world scenarios where the specific data setting (expert demonstrations without actions and sub-optimal demonstrations with actions) is prevalent. Providing practical applications, such as the use of shared-embodiment devices like the UMI gripper (https://umi-gripper.github.io/), would strengthen the motivation and highlight the method's relevance to practitioners. - *Absence of Real-World Experiments*: The lack of real
Generally well written paper, with easy to follow structure and well laid out motivations and claims. The authors make reasonable claims and specifically state their effort to contribute to the significant and challenging problem of learning from Observations, in the offline setting, which can be a prerequisite for large scale learning. Their experiments are reasonably displayed. The authors compare their method against the baselines in both their own dataset for the popular task in the D4RL an
I) Novelty. This author’s contribution lies in showcasing that SQIL type methods can potentially learn even if not considering actions, as long as the demonstration set is reasonably perfomant and in providing a new dataset that could be of use to the community. I believe that this would make a wonderful workshop paper as it further explores the idea that trying to regularize BC that diverges too far from what is demonstrated, can lead to better generalization than simple BC. II) N Minor a) T
Significance: Finding edge cases and distribution imbalance in the benchmarks followed in the literature and proving an alternative benchmark. Based on the findings in the paper - the prior benchmarks are biased to be bimodal. Also finding cases where the prior work doesn't perform as well - DILO[1] and SMODICE[2]. Originality: It is a mix of ideas from previous work. The algorithm is similar to the one proposed in DILO[1] - learning a value function and using it to learn a policy. Using S
1. The paper is an empirical experiment on using value functions for a mixture of expertly annotated datasets and background datasets. It would be nice to see rigorous study grounded in theory regarding policy improvement ? What are the bounds of improvement on the policy - the maximum performance that can be achieved by the policy ? Can the policy do better than the expert demonstrations, if yes, in what settings ? 2. There has been a mention of Advantage Weighted Regression(AWR)[1] in Sectio
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Domain Adaptation and Few-Shot Learning · Robot Manipulation and Learning
