Zero-Shot Imitating Collaborative Manipulation Plans from YouTube Cooking Videos
Hejia Zhang, Jie Zhong, Stefanos Nikolaidis

TL;DR
This paper presents a system that learns to interpret and execute collaborative manipulation actions from YouTube cooking videos, enabling robots to perform complex tasks by understanding human demonstrations in real-world, multi-person scenarios.
Contribution
It introduces a novel approach that leverages hierarchical language structures to understand and replicate collaborative manipulation plans from unstructured videos, with transferability to robotic systems.
Findings
Higher action detection accuracy than baseline methods
Effective execution of learned plans in simulation and real robots
Successful interpretation of multi-person collaborative actions
Abstract
People often watch videos on the web to learn how to cook new recipes, assemble furniture or repair a computer. We wish to enable robots with the very same capability. This is challenging; there is a large variation in manipulation actions and some videos even involve multiple persons, who collaborate by sharing and exchanging objects and tools. Furthermore, the learned representations need to be general enough to be transferable to robotic systems. On the other hand, previous work has shown that the space of human manipulation actions has a linguistic, hierarchical structure that relates actions to manipulated objects and tools. Building upon this theory of language for action, we propose a system for understanding and executing demonstrated action sequences from full-length, real-world cooking videos on the web. The system takes as input a new, previously unseen cooking video…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Robot Manipulation and Learning
MethodsRepair
