Imitation Learning from Observation with Automatic Discount Scheduling
Yuyang Liu, Weijun Dong, Yingdong Hu, Chuan Wen, Zhao-Heng Yin,, Chongjie Zhang, Yang Gao

TL;DR
This paper introduces an Automatic Discount Scheduling mechanism for imitation learning from observations, enabling agents to learn sequential behaviors more effectively by adaptively adjusting reward emphasis during training.
Contribution
The paper proposes a novel framework with ADS that improves learning in progress-dependent tasks by dynamically adjusting discount factors, outperforming existing methods.
Findings
Significantly outperforms state-of-the-art methods on nine Meta-World tasks.
Addresses challenges in progress-dependent imitation learning tasks.
Enables learning of earlier behaviors before later ones.
Abstract
Humans often acquire new skills through observation and imitation. For robotic agents, learning from the plethora of unlabeled video demonstration data available on the Internet necessitates imitating the expert without access to its action, presenting a challenge known as Imitation Learning from Observations (ILfO). A common approach to tackle ILfO problems is to convert them into inverse reinforcement learning problems, utilizing a proxy reward computed from the agent's and the expert's observations. Nonetheless, we identify that tasks characterized by a progress dependency property pose significant challenges for such approaches; in these tasks, the agent needs to initially learn the expert's preceding behaviors before mastering the subsequent ones. Our investigation reveals that the main cause is that the reward signals assigned to later steps hinder the learning of initial…
Peer Reviews
Decision·ICLR 2024 poster
+ The presented idea is simple and well motivated. + Strong empirical performance compared to selected baselines.
- While the presented idea is simple and interesting, it demands further analysis: - If the goal is to first learn to follow earlier parts of trajectories first, and then move forward once policy learns, why not simply put a scheduler on truncating the expert trajectories, instead of on the discount factor? Changing the discount factor seems unnatural, especially considering that it is used together with an off-policy RL algorithm. As soon as one changes the discount factor, the target Q value
1. As demonstrated by the paper, the problem of progress dependencies is a critical obstacle for effective ILfO learning. Several persuasive examples provided by paper illustrates this point. The proposed solution seizes a key part of the cause of this issue and posit a well-designed learning technique - ADS to avoid it. The demonstration is quite clear and algorithm design is intuitive and reasonable. 2. Experiments are comprehensive with sufficient performance gain. Ablation study is abundant.
1. I'm quite curious about the motivation of this paper: it is clear by reading the introduction part to know that proxy reward based ILfO is susceptible to such progress dependency issue. However, the problem seems to be similar to a common issue for reinforcement learning which is called the catastrophic forgetting problem. Also classic methods like Q-learning already involves a replay buffer to avoid the possibility of being stuck by a local optimality, or the so-called instability problem of
The research stands out in its originality by identifying a previously unaddressed challenge in conventional ILfO algorithms, specifically their limitations in handling tasks with progress dependency. Moreover, the introduction of the Automatic Discount Scheduling (ADS) mechanism within the ILfO framework is a novel contribution, showcasing a creative combination of existing ideas to address a new problem. The quality of the research is evident in its thorough approach to problem-solving. The a
At its heart, the paper's key proposition seems intuitive. Given that the objective is to imitate a sequence of actions, it's somewhat expected that there should be a dependency between actions. The current approach might be seen as a direct response to an oversight in the original problem formulation. Exploring more sophisticated reward designs or distance measurements could potentially offer a more nuanced solution to the challenge.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Advanced Bandit Algorithms Research · Explainable Artificial Intelligence (XAI)
