RoboMirror: Understand Before You Imitate for Video to Humanoid Locomotion

Zhe Li; Cheng Chi; Boan Zhu; Yangyang Wei; Shuanghao Bai; Yuheng Ji; Yibo Peng; Tao Huang; Pengwei Wang; Zhongyuan Wang; S.-H. Gary Chan; Chang Xu; Shanghang Zhang

arXiv:2512.23649·cs.RO·January 6, 2026

RoboMirror: Understand Before You Imitate for Video to Humanoid Locomotion

Zhe Li, Cheng Chi, Boan Zhu, Yangyang Wei, Shuanghao Bai, Yuheng Ji, Yibo Peng, Tao Huang, Pengwei Wang, Zhongyuan Wang, S.-H. Gary Chan, Chang Xu, Shanghang Zhang

PDF

Open Access

TL;DR

RoboMirror introduces a novel video-to-locomotion framework that interprets visual content to generate realistic humanoid movements, bridging the gap between visual understanding and control for improved telepresence and efficiency.

Contribution

It is the first retargeting-free, diffusion-based approach that directly translates videos into humanoid locomotion without pose reconstruction or staged pipelines.

Findings

01

Reduces third-person control latency by 80%

02

Achieves 3.7% higher task success rate than baselines

03

Enables telepresence via egocentric videos

Abstract

Humans learn locomotion through visual observation, interpreting visual content first before imitating actions. However, state-of-the-art humanoid locomotion systems rely on either curated motion capture trajectories or sparse text commands, leaving a critical gap between visual understanding and control. Text-to-motion methods suffer from semantic sparsity and staged pipeline errors, while video-based approaches only perform mechanical pose mimicry without genuine visual understanding. We propose RoboMirror, the first retargeting-free video-to-locomotion framework embodying "understand before you imitate". Leveraging VLMs, it distills raw egocentric/third-person videos into visual motion intents, which directly condition a diffusion-based policy to generate physically plausible, semantically aligned locomotion without explicit pose reconstruction or retargeting. Extensive experiments…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSocial Robot Interaction and HRI · Human Pose and Action Recognition · Human Motion and Animation