Learning from Massive Human Videos for Universal Humanoid Pose Control
Jiageng Mao, Siheng Zhao, Siqi Song, Tianheng Shi, Junjie Ye, Mingtong, Zhang, Haoran Geng, Jitendra Malik, Vitor Guizilini, Yue Wang

TL;DR
This paper presents Humanoid-X, a large-scale dataset of human poses and descriptions, enabling a humanoid robot model to learn from massive human videos for improved generalization in control tasks.
Contribution
Introduction of Humanoid-X dataset and UH-1 model, leveraging internet-sourced human videos for scalable, generalizable humanoid robot control through text instructions.
Findings
UH-1 outperforms existing models in generalization tasks
Humanoid-X enables effective real-world deployment
Scalable training improves adaptability of humanoid robots
Abstract
Scalable learning of humanoid robots is crucial for their deployment in real-world applications. While traditional approaches primarily rely on reinforcement learning or teleoperation to achieve whole-body control, they are often limited by the diversity of simulated environments and the high costs of demonstration collection. In contrast, human videos are ubiquitous and present an untapped source of semantic and motion information that could significantly enhance the generalization capabilities of humanoid robots. This paper introduces Humanoid-X, a large-scale dataset of over 20 million humanoid robot poses with corresponding text-based motion descriptions, designed to leverage this abundant data. Humanoid-X is curated through a comprehensive pipeline: data mining from the Internet, video caption generation, motion retargeting of humans to humanoid robots, and policy learning for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Hand Gesture Recognition Systems · Robot Manipulation and Learning
