H2R-Grounder: A Paired-Data-Free Paradigm for Translating Human Interaction Videos into Physically Grounded Robot Videos
Hai Ci, Xiaokang Liu, Pei Yang, Yiren Song, Mike Zheng Shou

TL;DR
H2R-Grounder translates human interaction videos into realistic robot videos without paired data by using a transferable representation and a fine-tuned diffusion model, enabling scalable robot learning from unlabeled videos.
Contribution
It introduces a paired-data-free framework that converts human videos into robot videos using inpainting and overlay techniques, fine-tuned with a state-of-the-art diffusion model.
Findings
Produces more realistic, physically grounded robot videos than baselines
Does not require paired human-robot training data
Scales robot learning from unlabeled human videos
Abstract
Robots that learn manipulation skills from everyday human videos could acquire broad capabilities without tedious robot data collection. We propose a video-to-video translation framework that converts ordinary human-object interaction videos into motion-consistent robot manipulation videos with realistic, physically grounded interactions. Our approach does not require any paired human-robot videos for training only a set of unpaired robot videos, making the system easy to scale. We introduce a transferable representation that bridges the embodiment gap: by inpainting the robot arm in training videos to obtain a clean background and overlaying a simple visual cue (a marker and arrow indicating the gripper's position and orientation), we can condition a generative model to insert the robot arm back into the scene. At test time, we apply the same process to human videos (inpainting the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRobot Manipulation and Learning · Human Pose and Action Recognition · Social Robot Interaction and HRI
