X-Humanoid: Robotize Human Videos to Generate Humanoid Videos at Scale

Pei Yang; Hai Ci; Yiren Song; Mike Zheng Shou

arXiv:2512.04537·cs.CV·December 5, 2025

X-Humanoid: Robotize Human Videos to Generate Humanoid Videos at Scale

Pei Yang, Hai Ci, Yiren Song, Mike Zheng Shou

PDF

Open Access

TL;DR

X-Humanoid is a novel generative video editing method that transforms human videos into humanoid robot videos, enabling large-scale dataset creation for embodied AI with high motion and embodiment quality.

Contribution

The paper introduces a scalable pipeline and a fine-tuned model for human-to-humanoid video translation, addressing limitations of previous overlay methods.

Findings

01

Generated over 3.6 million humanoid video frames from 60 hours of data.

02

Achieved 69% user preference for motion consistency.

03

Achieved 62.1% user preference for embodiment correctness.

Abstract

The advancement of embodied AI has unlocked significant potential for intelligent humanoid robots. However, progress in both Vision-Language-Action (VLA) models and world models is severely hampered by the scarcity of large-scale, diverse training data. A promising solution is to "robotize" web-scale human videos, which has been proven effective for policy training. However, these solutions mainly "overlay" robot arms to egocentric videos, which cannot handle complex full-body motions and scene occlusions in third-person videos, making them unsuitable for robotizing humans. To bridge this gap, we introduce X-Humanoid, a generative video editing approach that adapts the powerful Wan 2.2 model into a video-to-video structure and finetunes it for the human-to-humanoid translation task. This finetuning requires paired human-humanoid videos, so we designed a scalable data creation pipeline,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSocial Robot Interaction and HRI · Multimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis