Object-centric 3D Motion Field for Robot Learning from Human Videos

Zhao-Heng Yin; Sherry Yang; Pieter Abbeel

arXiv:2506.04227·cs.RO·June 5, 2025

Object-centric 3D Motion Field for Robot Learning from Human Videos

Zhao-Heng Yin, Sherry Yang, Pieter Abbeel

PDF

Open Access

TL;DR

This paper introduces an object-centric 3D motion field representation for robot learning from human videos, enabling zero-shot control and improving motion estimation accuracy for diverse manipulation tasks.

Contribution

It presents a novel framework with a denoising 3D motion estimator and a dense prediction architecture, enhancing transferability and generalization in robot learning from videos.

Findings

01

Reduces 3D motion estimation error by over 50%.

02

Achieves 55% success rate in diverse tasks.

03

Enables fine-grained manipulation skills like insertion.

Abstract

Learning robot control policies from human videos is a promising direction for scaling up robot learning. However, how to extract action knowledge (or action representations) from videos for policy learning remains a key challenge. Existing action representations such as video frames, pixelflow, and pointcloud flow have inherent limitations such as modeling complexity or loss of information. In this paper, we propose to use object-centric 3D motion field to represent actions for robot learning from human videos, and present a novel framework for extracting this representation from videos for zero-shot control. We introduce two novel components in its implementation. First, a novel training pipeline for training a ''denoising'' 3D motion field estimator to extract fine object 3D motions from human videos with noisy depth robustly. Second, a dense object-centric 3D motion field prediction…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Pose and Action Recognition · Robot Manipulation and Learning · Generative Adversarial Networks and Image Synthesis