TL;DR
REDS is a novel reward learning framework that uses segmented video demonstrations to train dense reward functions, enabling reinforcement learning agents to perform complex robotic tasks with minimal supervision and better generalization.
Contribution
REDS introduces a subtask-aware reward learning method from segmented videos, improving reward signal quality and generalization in robotic manipulation tasks.
Findings
Outperforms baseline methods on Meta-World tasks
Achieves significant success in furniture assembly in FurnitureBench
Enables generalization to unseen tasks and robot types
Abstract
Reinforcement Learning (RL) agents have demonstrated their potential across various robotic tasks. However, they still heavily rely on human-engineered reward functions, requiring extensive trial-and-error and access to target behavior information, often unavailable in real-world settings. This paper introduces REDS: REward learning from Demonstration with Segmentations, a novel reward learning framework that leverages action-free videos with minimal supervision. Specifically, REDS employs video demonstrations segmented into subtasks from diverse sources and treats these segments as ground-truth rewards. We train a dense reward function conditioned on video segments and their corresponding subtasks to ensure alignment with ground-truth reward signals by minimizing the Equivalent-Policy Invariant Comparison distance. Additionally, we employ contrastive learning objectives to align video…
Peer Reviews
Decision·ICLR 2025 Poster
Overall, the paper is well written, and the problem is very well explained. The authors have clearly delineated their contributions from previous works. The experiments and ablations are detailed and informative. With respect to the proposed approach, the strengths are: The approach requires minimal human supervision in terms of defining the subtasks accurately. The proposed approach generalizes well for manipulation tasks with unseen objects. Additionally, since the reward model does not depend
The expert demonstrations would always contain the subtasks in a particular order. This might lead to poor reward signals when the subtask estimation turns out to be incorrect. Such instances could occur when the RL agent is exploring. The effect of the hyperparameter epsilon, which is used to enforce progressive reward signals within each subtask, is not clearly explained. The authors show an ablation for the cases with and without regularization loss. But, the effect of epsilon is not clearly
1. This paper is written clearly and highlights an important challenge for long-horizon reinforcement learning. 2. This paper provides a reward model training method that utilizes both expert demonstration videos and suboptimal videos.
- The method seems to rely heavily on carefully predefined subtasks or key completion points (like in table 6). This may limit the generalizability of the method.
i) In this paper, subtask information is integrated into the reward learning process. During training, the subtask is included in the reward function input as a text embedding that provides instructions on completing a specific subtask. In the inference phase, the text embedding is substituted with a video embedding as an additional input. ii) The approach employs the EPIC loss function to reduce the disparity between the predicted reward sequence and the ground truth reward. Experimental resul
i) In the training phase, this paper decomposes the overall task into multiple subtasks based on domain knowledge. However, the reliance on predefined instructions from the environment for task decomposition raises concerns about practical applicability. Some environments may lack such predefined knowledge, necessitating human annotations or the need for a learned task decomposition model when extending this approach to new environments. This raises doubts about the novelty and scalability of th
Videos
Taxonomy
MethodsContrastive Learning · ALIGN
