Let Me Show You: Learning by Retrieving from Egocentric Video for Robotic Manipulation
Yichen Zhu, Feifei Feng

TL;DR
This paper introduces a novel robot learning method called Retrieving-from-Video (RfV), which leverages human demonstration videos and mid-level information to improve robotic manipulation and generalization in complex environments.
Contribution
The paper presents a new approach that uses a video retrieval system and mid-level features from human demonstrations to enhance robot policy learning and generalization.
Findings
Significant performance improvements over traditional methods.
Effective generalization to unseen tasks.
Successful deployment in real-world robotic scenarios.
Abstract
Robots operating in complex and uncertain environments face considerable challenges. Advanced robotic systems often rely on extensive datasets to learn manipulation tasks. In contrast, when humans are faced with unfamiliar tasks, such as assembling a chair, a common approach is to learn by watching video demonstrations. In this paper, we propose a novel method for learning robot policies by Retrieving-from-Video (RfV), using analogies from human demonstrations to address manipulation tasks. Our system constructs a video bank comprising recordings of humans performing diverse daily tasks. To enrich the knowledge from these videos, we extract mid-level information, such as object affordance masks and hand motion trajectories, which serve as additional inputs to enhance the robot model's learning and generalization capabilities. We further feature a dual-component system: a video retriever…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRobot Manipulation and Learning · Reinforcement Learning in Robotics · Multimodal Machine Learning Applications
