VideoWeaver: Multimodal Multi-View Video-to-Video Transfer for Embodied Agents
George Eskandar, Fengyi Shen, Mohammad Altillawi, Dong Chen, Yang Bai, Liudi Yang, Ziyuan Liu

TL;DR
VideoWeaver introduces a multimodal multi-view video-to-video translation framework that ensures view consistency and scalability across multiple synchronized cameras, advancing embodied AI demonstration resimulation.
Contribution
It is the first to extend V2V translation to multi-view settings using a shared 4D latent space and autoregressive synthesis for multiple viewpoints.
Findings
Achieves state-of-the-art results on single-view benchmarks.
Demonstrates physically and stylistically consistent multi-view translations.
Enables scalable multi-camera translation with dynamic viewpoints.
Abstract
Recent progress in video-to-video (V2V) translation has enabled realistic resimulation of embodied AI demonstrations, a capability that allows pretrained robot policies to be transferable to new environments without additional data collection. However, prior works can only operate on a single view at a time, while embodied AI tasks are commonly captured from multiple synchronized cameras to support policy learning. Naively applying single-view models independently to each camera leads to inconsistent appearance across views, and standard transformer architectures do not scale to multi-view settings due to the quadratic cost of cross-view attention. We present VideoWeaver, the first multimodal multi-view V2V translation framework. VideoWeaver is initially trained as a single-view flow-based V2V model. To achieve an extension to the multi-view regime, we propose to ground all views in a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Advanced Vision and Imaging · Face recognition and analysis
