Learning Video-Conditioned Policies for Unseen Manipulation Tasks
Elliot Chane-Sane, Cordelia Schmid, Ivan Laptev

TL;DR
This paper introduces ViP, a video-conditioned policy learning method enabling robots to perform unseen manipulation tasks from human demonstration videos in a zero-shot setting, without task-specific training.
Contribution
The paper presents a novel zero-shot learning approach that maps human demonstration videos to robot actions using pre-trained video embeddings, avoiding task-specific training data.
Findings
Outperforms state-of-the-art in multi-task manipulation environments
Enables zero-shot robot control from human videos
Effective generalization to unseen tasks
Abstract
The ability to specify robot commands by a non-expert user is critical for building generalist agents capable of solving a large variety of tasks. One convenient way to specify the intended robot goal is by a video of a person demonstrating the target task. While prior work typically aims to imitate human demonstrations performed in robot environments, here we focus on a more realistic and challenging setup with demonstrations recorded in natural and diverse human environments. We propose Video-conditioned Policy learning (ViP), a data-driven approach that maps human demonstrations of previously unseen tasks to robot manipulation skills. To this end, we learn our policy to generate appropriate actions given current scene observations and a video of the target task. To encourage generalization to new tasks, we avoid particular tasks during training and learn our policy from unlabelled…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Multimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning
MethodsTest
