VideoWeaver: Multimodal Multi-View Video-to-Video Transfer for Embodied Agents

George Eskandar; Fengyi Shen; Mohammad Altillawi; Dong Chen; Yang Bai; Liudi Yang; Ziyuan Liu

arXiv:2603.25420·cs.CV·March 27, 2026

VideoWeaver: Multimodal Multi-View Video-to-Video Transfer for Embodied Agents

George Eskandar, Fengyi Shen, Mohammad Altillawi, Dong Chen, Yang Bai, Liudi Yang, Ziyuan Liu

PDF

Open Access

TL;DR

VideoWeaver introduces a multimodal multi-view video-to-video translation framework that ensures view consistency and scalability across multiple synchronized cameras, advancing embodied AI demonstration resimulation.

Contribution

It is the first to extend V2V translation to multi-view settings using a shared 4D latent space and autoregressive synthesis for multiple viewpoints.

Findings

01

Achieves state-of-the-art results on single-view benchmarks.

02

Demonstrates physically and stylistically consistent multi-view translations.

03

Enables scalable multi-camera translation with dynamic viewpoints.

Abstract

Recent progress in video-to-video (V2V) translation has enabled realistic resimulation of embodied AI demonstrations, a capability that allows pretrained robot policies to be transferable to new environments without additional data collection. However, prior works can only operate on a single view at a time, while embodied AI tasks are commonly captured from multiple synchronized cameras to support policy learning. Naively applying single-view models independently to each camera leads to inconsistent appearance across views, and standard transformer architectures do not scale to multi-view settings due to the quadratic cost of cross-view attention. We present VideoWeaver, the first multimodal multi-view V2V translation framework. VideoWeaver is initially trained as a single-view flow-based V2V model. To achieve an extension to the multi-view regime, we propose to ground all views in a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Advanced Vision and Imaging · Face recognition and analysis