Bridging the Embodiment Gap: Disentangled Cross-Embodiment Video Editing

Zhiyuan Li; Wenyan Yang; Wenshuai Zhao; Yue Ma; Yuanpeng Tu; Pekka Marttinen; Joni Pajarinen

arXiv:2605.03637·cs.RO·May 6, 2026

Bridging the Embodiment Gap: Disentangled Cross-Embodiment Video Editing

Zhiyuan Li, Wenyan Yang, Wenshuai Zhao, Yue Ma, Yuanpeng Tu, Pekka Marttinen, Joni Pajarinen

PDF

TL;DR

This paper introduces a generative framework that learns disentangled representations for cross-embodiment video editing, enabling the synthesis of robot demonstrations from human videos without paired data.

Contribution

It proposes a dual contrastive objective to learn independent task and embodiment representations, improving cross-embodiment video synthesis for robotics.

Findings

01

Produces temporally consistent robot videos from human demonstrations.

02

Learns disentangled representations without paired cross-embodiment data.

03

Enables scalable robot learning from internet-scale human videos.

Abstract

Learning robotic manipulation from human videos is a promising solution to the data bottleneck in robotics, but the distribution shift between humans and robots remains a critical challenge. Existing approaches often produce entangled representations, where task-relevant information is coupled with human-specific kinematics, limiting their adaptability. We propose a generative framework for cross-embodiment video editing that directly addresses this by learning explicitly disentangled task and embodiment representations. Our method factorizes a demonstration video into two orthogonal latent spaces by enforcing a dual contrastive objective: it minimizes mutual information between the spaces to ensure independence while maximizing intra-space consistency to create stable representations. A parameter-efficient adapter injects these latent codes into a frozen video diffusion model, enabling…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.