Context as Memory: Scene-Consistent Interactive Long Video Generation with Memory Retrieval

Jiwen Yu; Jianhong Bai; Yiran Qin; Quande Liu; Xintao Wang; Pengfei Wan; Di Zhang; Xihui Liu

arXiv:2506.03141·cs.CV·August 13, 2025

Context as Memory: Scene-Consistent Interactive Long Video Generation with Memory Retrieval

Jiwen Yu, Jianhong Bai, Yiran Qin, Quande Liu, Xintao Wang, Pengfei Wan, Di Zhang, Xihui Liu

PDF

Open Access 1 Datasets

TL;DR

This paper introduces Context-as-Memory, a novel approach for long video generation that effectively utilizes historical context through memory retrieval based on camera pose overlap, improving scene consistency and generalization.

Contribution

It presents a simple memory storage and conditioning method combined with a memory retrieval module based on FOV overlap, enhancing scene consistency in long video generation.

Findings

01

Outperforms state-of-the-art methods in memory capabilities

02

Generalizes well to open-domain scenarios

03

Reduces computational overhead with effective memory retrieval

Abstract

Recent advances in interactive video generation have shown promising results, yet existing approaches struggle with scene-consistent memory capabilities in long video generation due to limited use of historical context. In this work, we propose Context-as-Memory, which utilizes historical context as memory for video generation. It includes two simple yet effective designs: (1) storing context in frame format without additional post-processing; (2) conditioning by concatenating context and frames to be predicted along the frame dimension at the input, requiring no external control modules. Furthermore, considering the enormous computational overhead of incorporating all historical context, we propose the Memory Retrieval module to select truly relevant context frames by determining FOV (Field of View) overlap between camera poses, which significantly reduces the number of candidate…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

KlingTeam/Context-as-Memory-Dataset
dataset· 236 dl
236 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Human Motion and Animation · Advanced Vision and Imaging