Object-centric 3D Motion Field for Robot Learning from Human Videos
Zhao-Heng Yin, Sherry Yang, Pieter Abbeel

TL;DR
This paper introduces an object-centric 3D motion field representation for robot learning from human videos, enabling zero-shot control and improving motion estimation accuracy for diverse manipulation tasks.
Contribution
It presents a novel framework with a denoising 3D motion estimator and a dense prediction architecture, enhancing transferability and generalization in robot learning from videos.
Findings
Reduces 3D motion estimation error by over 50%.
Achieves 55% success rate in diverse tasks.
Enables fine-grained manipulation skills like insertion.
Abstract
Learning robot control policies from human videos is a promising direction for scaling up robot learning. However, how to extract action knowledge (or action representations) from videos for policy learning remains a key challenge. Existing action representations such as video frames, pixelflow, and pointcloud flow have inherent limitations such as modeling complexity or loss of information. In this paper, we propose to use object-centric 3D motion field to represent actions for robot learning from human videos, and present a novel framework for extracting this representation from videos for zero-shot control. We introduce two novel components in its implementation. First, a novel training pipeline for training a ''denoising'' 3D motion field estimator to extract fine object 3D motions from human videos with noisy depth robustly. Second, a dense object-centric 3D motion field prediction…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Robot Manipulation and Learning · Generative Adversarial Networks and Image Synthesis
