Bridging the Embodiment Gap: Disentangled Cross-Embodiment Video Editing
Zhiyuan Li, Wenyan Yang, Wenshuai Zhao, Yue Ma, Yuanpeng Tu, Pekka Marttinen, Joni Pajarinen

TL;DR
This paper introduces a generative framework that learns disentangled representations for cross-embodiment video editing, enabling the synthesis of robot demonstrations from human videos without paired data.
Contribution
It proposes a dual contrastive objective to learn independent task and embodiment representations, improving cross-embodiment video synthesis for robotics.
Findings
Produces temporally consistent robot videos from human demonstrations.
Learns disentangled representations without paired cross-embodiment data.
Enables scalable robot learning from internet-scale human videos.
Abstract
Learning robotic manipulation from human videos is a promising solution to the data bottleneck in robotics, but the distribution shift between humans and robots remains a critical challenge. Existing approaches often produce entangled representations, where task-relevant information is coupled with human-specific kinematics, limiting their adaptability. We propose a generative framework for cross-embodiment video editing that directly addresses this by learning explicitly disentangled task and embodiment representations. Our method factorizes a demonstration video into two orthogonal latent spaces by enforcing a dual contrastive objective: it minimizes mutual information between the spaces to ensure independence while maximizing intra-space consistency to create stable representations. A parameter-efficient adapter injects these latent codes into a frozen video diffusion model, enabling…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
