From Generated Human Videos to Physically Plausible Robot Trajectories
James Ni, Zekai Wang, Wei Lin, Amir Bar, Yann LeCun, Trevor Darrell, Jitendra Malik, Roei Herzig

TL;DR
This paper presents a novel pipeline that converts generated human videos into physically plausible robot trajectories, enabling zero-shot imitation of human actions by robots using a new benchmark and reinforcement learning techniques.
Contribution
It introduces a two-stage process for lifting videos to 3D representations and retargeting to robots, along with a physics-aware RL policy and a new benchmark for zero-shot generalization.
Findings
Improved simulation performance over baselines
Physically stable motion tracking on a humanoid robot
Effective zero-shot imitation from noisy generated videos
Abstract
Video generation models are rapidly improving in their ability to synthesize human actions in novel contexts, holding the potential to serve as high-level planners for contextual robot control. To realize this potential, a key research question remains open: how can a humanoid execute the human actions from generated videos in a zero-shot manner? This challenge arises because generated videos are often noisy and exhibit morphological distortions that make direct imitation difficult compared to real video. To address this, we introduce a two-stage pipeline. First, we lift video pixels into a 4D human representation and then retarget to the humanoid morphology. Second, we propose GenMimic-a physics-aware reinforcement learning policy conditioned on 3D keypoints, and trained with symmetry regularization and keypoint-weighted tracking rewards. As a result, GenMimic can mimic human actions…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRobot Manipulation and Learning · Human Motion and Animation · Generative Adversarial Networks and Image Synthesis
