H-RDT: Human Manipulation Enhanced Bimanual Robotic Manipulation
Hongzhe Bi, Lingxuan Wu, Tianwei Lin, Hengkai Tan, Zhizhong Su, Hang Su, Jun Zhu

TL;DR
H-RDT leverages large-scale human manipulation videos with 3D hand poses to improve robotic manipulation, using a two-stage training process that enhances performance across various tasks and settings.
Contribution
The paper introduces H-RDT, a diffusion transformer model trained on human manipulation data to boost robotic manipulation capabilities, addressing data scarcity and embodiment diversity challenges.
Findings
H-RDT outperforms training from scratch and state-of-the-art methods.
Achieves 13.9% and 40.5% improvements in simulation and real-world tasks.
Effective in single-task, multitask, few-shot, and robustness scenarios.
Abstract
Imitation learning for robotic manipulation faces a fundamental challenge: the scarcity of large-scale, high-quality robot demonstration data. Recent robotic foundation models often pre-train on cross-embodiment robot datasets to increase data scale, while they face significant limitations as the diverse morphologies and action spaces across different robot embodiments make unified training challenging. In this paper, we present H-RDT (Human to Robotics Diffusion Transformer), a novel approach that leverages human manipulation data to enhance robot manipulation capabilities. Our key insight is that large-scale egocentric human manipulation videos with paired 3D hand pose annotations provide rich behavioral priors that capture natural manipulation strategies and can benefit robotic policy learning. We introduce a two-stage training paradigm: (1) pre-training on large-scale egocentric…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsRobot Manipulation and Learning
