Human2Robot: Learning Robot Actions from Paired Human-Robot Videos

Sicheng Xie; Haidong Cao; Zejia Weng; Zhen Xing; Haoran Chen; Shiwei Shen; Jiaqi Leng; Zuxuan Wu; Yu-Gang Jiang

arXiv:2502.16587·cs.RO·November 18, 2025

Human2Robot: Learning Robot Actions from Paired Human-Robot Videos

Sicheng Xie, Haidong Cao, Zejia Weng, Zhen Xing, Haoran Chen, Shiwei Shen, Jiaqi Leng, Zuxuan Wu, Yu-Gang Jiang

PDF

Open Access 1 Datasets 1 Video

TL;DR

This paper introduces Human2Robot, a novel framework that learns fine-grained robot actions from paired human-robot videos, enabling better manipulation and generalization through a new dataset and a video prediction approach.

Contribution

It presents a new dataset H&R and a framework that leverages video prediction to improve robot learning from human demonstrations, especially for complex and novel tasks.

Findings

01

High performance on seen tasks

02

Significant one-shot generalization to new scenarios

03

Effective learning of fine-grained robot dynamics

Abstract

Distilling knowledge from human demonstrations is a promising way for robots to learn and act. Existing methods, which often rely on coarsely-aligned video pairs, are typically constrained to learning global or task-level features. As a result, they tend to neglect the fine-grained frame-level dynamics required for complex manipulation and generalization to novel tasks. We posit that this limitation stems from a vicious circle of inadequate datasets and the methods they inspire. To break this cycle, we propose a paradigm shift that treats fine-grained human-robot alignment as a conditional video generation problem. To this end, we first introduce H&R, a novel third-person dataset containing 2,600 episodes of precisely synchronized human and robot motions, collected using a VR teleoperation system. We then present Human2Robot, a framework designed to leverage this data. Human2Robot…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

dannyXSC/HumanAndRobot
dataset· 1.6k dl
1.6k dl

Videos

Human2Robot: Learning Robot Actions from Paired Human-Robot Videos· underline

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Reinforcement Learning in Robotics