Memorize When Needed: Decoupled Memory Control for Spatially Consistent Long-Horizon Video Generation
Yanjun Guo, Zhengqiang Zhang, Pengfei Wang, Xinyue Liang, Zhiyuan Ma, Lei Zhang

TL;DR
This paper introduces a decoupled memory control framework for long-horizon video generation that improves spatial consistency and efficiency by separating memory from the generative process.
Contribution
It proposes a lightweight, independent memory module with cross-attention and camera-aware gating to enhance spatial consistency and reduce training costs.
Findings
Achieves state-of-the-art spatial consistency in generated videos.
Reduces training costs compared to entangled memory models.
Enhances the ability to explore novel scene regions.
Abstract
Spatially consistent long-horizon video generation aims to maintain temporal and spatial consistency along predefined camera trajectories. Existing methods mostly entangle memory modeling with video generation, leading to inconsistent content during scene revisits and diminished generative capacity when exploring novel regions, even trained on extensive annotated data. To address these limitations, we propose a decoupled framework that separates memory conditioning from generation. Our approach significantly reduces training costs while simultaneously enhancing spatial consistency and preserving the generative capacity for novel scene exploration. Specifically, we employ a lightweight, independent memory branch to learn precise spatial consistency from historical observation. We first introduce a hybrid memory representation to capture complementary temporal and spatial cues from…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
