Vision-based Manipulation from Single Human Video with Open-World Object Graphs

Yifeng Zhu; Arisrei Lim; Peter Stone; Yuke Zhu

arXiv:2405.20321·cs.RO·September 5, 2025·1 cites

Vision-based Manipulation from Single Human Video with Open-World Object Graphs

Yifeng Zhu, Arisrei Lim, Peter Stone, Yuke Zhu

PDF

Open Access

TL;DR

This paper introduces ORION, an object-centric algorithm enabling robots to learn manipulation skills from a single human video demonstration, generalizing across diverse environments and objects.

Contribution

The work presents a novel approach that extracts manipulation plans from single videos and conditions policies on these plans, advancing open-world robot imitation learning.

Findings

01

Achieved an average success rate of 74.4% across tasks.

02

Effective in generalizing to new objects and environments.

03

Works with RGB and RGB-D videos for manipulation learning.

Abstract

This work presents an object-centric approach to learning vision-based manipulation skills from human videos. We investigate the problem of robot manipulation via imitation in the open-world setting, where a robot learns to manipulate novel objects from a single video demonstration. We introduce ORION, an algorithm that tackles the problem by extracting an object-centric manipulation plan from a single RGB or RGB-D video and deriving a policy that conditions on the extracted plan. Our method enables the robot to learn from videos captured by daily mobile devices and to generalize the policies to deployment environments with varying visual backgrounds, camera angles, spatial layouts, and novel object instances. We systematically evaluate our method on both short-horizon and long-horizon tasks, using RGB-D and RGB-only demonstration videos. Across varied tasks and demonstration types…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVisual Attention and Saliency Detection · Video Surveillance and Tracking Methods · Advanced Neural Network Applications