Captain Safari: A World Engine with Pose-Aligned 3D Memory
Yu-Cheng Chou, Xingrui Wang, Yitong Li, Jiahao Wang, Hanting Liu, Cihang Xie, Alan Yuille, Junfei Xiao

TL;DR
Captain Safari introduces a pose-conditioned world engine that uses a persistent memory and retrieval mechanism to generate long, 3D-consistent videos with accurate camera control, outperforming existing methods on a new drone video dataset.
Contribution
The paper presents a novel pose-conditioned world engine with a dynamic memory retrieval system, enabling stable, long-horizon, 3D-consistent video synthesis under complex camera trajectories.
Findings
Outperforms state-of-the-art camera-controlled video generators in quality and consistency.
Reduces error metrics such as MEt3R and FVD significantly.
Achieves higher preference scores in human evaluations.
Abstract
World engines aim to synthesize long, 3D-consistent videos that support interactive exploration of a scene under user-controlled camera motion. However, existing systems struggle under aggressive 6-DoF trajectories and complex outdoor layouts: they lose long-range geometric coherence, deviate from the target path, or collapse into overly conservative motion. To this end, we introduce Captain Safari, a pose-conditioned world engine that generates videos by retrieving from a persistent world memory. Given a camera path, our method maintains a dynamic local memory and uses a retriever to fetch pose-aligned world tokens, which then condition video generation along the trajectory. This design enables the model to maintain stable 3D structure while accurately executing challenging camera maneuvers. To evaluate this setting, we curate OpenSafari, a new in-the-wild FPV dataset containing…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRobotics and Sensor-Based Localization · Advanced Vision and Imaging · Multimodal Machine Learning Applications
