MosaicMem: Hybrid Spatial Memory for Controllable Video World Models

Wei Yu; Runjia Qian; Yumeng Li; Liquan Wang; Songheng Yin; Sri Siddarth Chakaravarthy P; Dennis Anthony; Yang Ye; Yidi Li; Weiwei Wan; Animesh Garg

arXiv:2603.17117·cs.CV·March 19, 2026

MosaicMem: Hybrid Spatial Memory for Controllable Video World Models

Wei Yu, Runjia Qian, Yumeng Li, Liquan Wang, Songheng Yin, Sri Siddarth Chakaravarthy P, Dennis Anthony, Yang Ye, Yidi Li, Weiwei Wan, Animesh Garg

PDF

Open Access

TL;DR

MosaicMem introduces a hybrid spatial memory system for video world models that combines 3D patch localization with model conditioning, enhancing consistency, dynamic scene modeling, and controllability in video generation.

Contribution

It proposes Mosaic Memory, a novel hybrid spatial memory that improves localization and scene consistency in video models by combining explicit 3D patch lifting with implicit model conditioning.

Findings

01

Improved pose adherence over implicit memory methods.

02

Enhanced dynamic scene modeling compared to explicit baselines.

03

Enabled minute-level navigation and scene editing in video models.

Abstract

Video diffusion models are moving beyond short, plausible clips toward world simulators that must remain consistent under camera motion, revisits, and intervention. Yet spatial memory remains a key bottleneck: explicit 3D structures can improve reprojection-based consistency but struggle to depict moving objects, while implicit memory often produces inaccurate camera motion even with correct poses. We propose Mosaic Memory (MosaicMem), a hybrid spatial memory that lifts patches into 3D for reliable localization and targeted retrieval, while exploiting the model's native conditioning to preserve prompt-following generation. MosaicMem composes spatially aligned patches in the queried view via a patch-and-compose interface, preserving what should persist while allowing the model to inpaint what should evolve. With PRoPE camera conditioning and two new memory alignment methods, experiments…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Vision and Imaging · Human Motion and Animation · Robotics and Sensor-Based Localization