WorldWander: Bridging Egocentric and Exocentric Worlds in Video Generation

Quanjian Song; Yiren Song; Kelly Peng; Yuan Gao; Mike Zheng Shou

arXiv:2511.22098·cs.CV·December 1, 2025

WorldWander: Bridging Egocentric and Exocentric Worlds in Video Generation

Quanjian Song, Yiren Song, Kelly Peng, Yuan Gao, Mike Zheng Shou

PDF

Open Access

TL;DR

WorldWander introduces a novel framework for translating videos between egocentric and exocentric perspectives, leveraging advanced diffusion transformers and a new dataset to improve synchronization and consistency.

Contribution

The paper presents WorldWander, a new in-context learning approach with specialized modules and a large-scale dataset for cross-view video translation.

Findings

01

Achieves superior perspective synchronization

02

Maintains character consistency across views

03

Sets a new benchmark in egocentric-exocentric video translation

Abstract

Video diffusion models have recently achieved remarkable progress in realism and controllability. However, achieving seamless video translation across different perspectives, such as first-person (egocentric) and third-person (exocentric), remains underexplored. Bridging these perspectives is crucial for filmmaking, embodied AI, and world models. Motivated by this, we present WorldWander, an in-context learning framework tailored for translating between egocentric and exocentric worlds in video generation. Building upon advanced video diffusion transformers, WorldWander integrates (i) In-Context Perspective Alignment and (ii) Collaborative Position Encoding to efficiently model cross-view synchronization. To further support our task, we curate EgoExo-8K, a large-scale dataset containing synchronized egocentric-exocentric triplets from both synthetic and real-world scenarios. Experiments…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Human Motion and Animation · Face recognition and analysis