Morphology-Consistent Humanoid Interaction through Robot-Centric Video Synthesis
Weisheng Xu, Jian Li, Yi Gu, Bin Yang, Haodong Chen, Shuyi Lin, Mingqian Zhou, Jing Tan, Qiwei Wu, Xiangrui Jiang, Taowen Wang, Jiawen Wen, Qiwei Liang, Jiaxi Zhang, Renjing Xu

TL;DR
This paper introduces Dream2Act, a robot-centric video synthesis framework that enables zero-shot humanoid interaction by generating morphology-consistent motions from third-person images, avoiding retargeting errors and extensive training.
Contribution
Dream2Act is a novel zero-shot framework that synthesizes robot-native, physically feasible motions directly from video, eliminating the need for task-specific policy training or retargeting.
Findings
Achieves 37.5% success rate across four tasks.
Outperforms retargeting with 0% success rate.
Maintains spatial alignment and physical contact formation.
Abstract
Equipping humanoid robots with versatile interaction skills typically requires either extensive policy training or explicit human-to-robot motion retargeting. However, learning-based policies face prohibitive data collection costs. Meanwhile, retargeting relies on human-centric pose estimation (e.g., SMPL), introducing a morphology gap. Skeletal scale mismatches result in severe spatial misalignments when mapped to robots, compromising interaction success. In this work, we propose Dream2Act, a robot-centric framework enabling zero-shot interaction through generative video synthesis. Given a third-person image of the robot and target object, our framework leverages video generation models to envision the robot completing the task with morphology-consistent motion. We employ a high-fidelity pose extraction system to recover physically feasible, robot-native joint trajectories from these…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRobot Manipulation and Learning · Social Robot Interaction and HRI · Human Pose and Action Recognition
