Structured World Models from Human Videos
Russell Mendonca, Shikhar Bahl, Deepak Pathak

TL;DR
This paper introduces a method for robots to learn manipulation skills rapidly by leveraging structured, human-centric action spaces and world models trained on internet-scale human videos, enabling effective learning from minimal real-world interaction.
Contribution
The paper presents a novel approach that combines visual affordances from human videos with world models, allowing robots to learn complex manipulation skills efficiently with limited real-world data.
Findings
Robots can learn various manipulation skills in under 30 minutes.
The approach effectively transfers knowledge from human videos to robot manipulation.
Fine-tuning on small datasets achieves significant skill acquisition without task supervision.
Abstract
We tackle the problem of learning complex, general behaviors directly in the real world. We propose an approach for robots to efficiently learn manipulation skills using only a handful of real-world interaction trajectories from many different settings. Inspired by the success of learning from large-scale datasets in the fields of computer vision and natural language, our belief is that in order to efficiently learn, a robot must be able to leverage internet-scale, human video data. Humans interact with the world in many interesting ways, which can allow a robot to not only build an understanding of useful actions and affordances but also how these actions affect the world for manipulation. Our approach builds a structured, human-centric action space grounded in visual affordances learned from human videos. Further, we train a world model on human videos and fine-tune on a small amount…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Multimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning
