H2R-Grounder: A Paired-Data-Free Paradigm for Translating Human Interaction Videos into Physically Grounded Robot Videos

Hai Ci; Xiaokang Liu; Pei Yang; Yiren Song; Mike Zheng Shou

arXiv:2512.09406·cs.RO·December 11, 2025

H2R-Grounder: A Paired-Data-Free Paradigm for Translating Human Interaction Videos into Physically Grounded Robot Videos

Hai Ci, Xiaokang Liu, Pei Yang, Yiren Song, Mike Zheng Shou

PDF

Open Access

TL;DR

H2R-Grounder translates human interaction videos into realistic robot videos without paired data by using a transferable representation and a fine-tuned diffusion model, enabling scalable robot learning from unlabeled videos.

Contribution

It introduces a paired-data-free framework that converts human videos into robot videos using inpainting and overlay techniques, fine-tuned with a state-of-the-art diffusion model.

Findings

01

Produces more realistic, physically grounded robot videos than baselines

02

Does not require paired human-robot training data

03

Scales robot learning from unlabeled human videos

Abstract

Robots that learn manipulation skills from everyday human videos could acquire broad capabilities without tedious robot data collection. We propose a video-to-video translation framework that converts ordinary human-object interaction videos into motion-consistent robot manipulation videos with realistic, physically grounded interactions. Our approach does not require any paired human-robot videos for training only a set of unpaired robot videos, making the system easy to scale. We introduce a transferable representation that bridges the embodiment gap: by inpainting the robot arm in training videos to obtain a clean background and overlaying a simple visual cue (a marker and arrow indicating the gripper's position and orientation), we can condition a generative model to insert the robot arm back into the scene. At test time, we apply the same process to human videos (inpainting the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsRobot Manipulation and Learning · Human Pose and Action Recognition · Social Robot Interaction and HRI