Morphology-Consistent Humanoid Interaction through Robot-Centric Video Synthesis

Weisheng Xu; Jian Li; Yi Gu; Bin Yang; Haodong Chen; Shuyi Lin; Mingqian Zhou; Jing Tan; Qiwei Wu; Xiangrui Jiang; Taowen Wang; Jiawen Wen; Qiwei Liang; Jiaxi Zhang; Renjing Xu

arXiv:2603.19709·cs.RO·March 25, 2026

Morphology-Consistent Humanoid Interaction through Robot-Centric Video Synthesis

Weisheng Xu, Jian Li, Yi Gu, Bin Yang, Haodong Chen, Shuyi Lin, Mingqian Zhou, Jing Tan, Qiwei Wu, Xiangrui Jiang, Taowen Wang, Jiawen Wen, Qiwei Liang, Jiaxi Zhang, Renjing Xu

PDF

Open Access

TL;DR

This paper introduces Dream2Act, a robot-centric video synthesis framework that enables zero-shot humanoid interaction by generating morphology-consistent motions from third-person images, avoiding retargeting errors and extensive training.

Contribution

Dream2Act is a novel zero-shot framework that synthesizes robot-native, physically feasible motions directly from video, eliminating the need for task-specific policy training or retargeting.

Findings

01

Achieves 37.5% success rate across four tasks.

02

Outperforms retargeting with 0% success rate.

03

Maintains spatial alignment and physical contact formation.

Abstract

Equipping humanoid robots with versatile interaction skills typically requires either extensive policy training or explicit human-to-robot motion retargeting. However, learning-based policies face prohibitive data collection costs. Meanwhile, retargeting relies on human-centric pose estimation (e.g., SMPL), introducing a morphology gap. Skeletal scale mismatches result in severe spatial misalignments when mapped to robots, compromising interaction success. In this work, we propose Dream2Act, a robot-centric framework enabling zero-shot interaction through generative video synthesis. Given a third-person image of the robot and target object, our framework leverages video generation models to envision the robot completing the task with morphology-consistent motion. We employ a high-fidelity pose extraction system to recover physically feasible, robot-native joint trajectories from these…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsRobot Manipulation and Learning · Social Robot Interaction and HRI · Human Pose and Action Recognition