MemFlow: Flowing Adaptive Memory for Consistent and Efficient Long Video Narratives
Sihui Ji, Xi Chen, Shuai Yang, Xin Tao, Pengfei Wan, Hengshuang Zhao

TL;DR
MemFlow introduces a dynamic memory system for long video generation that retrieves relevant historical frames based on text prompts, ensuring narrative coherence and efficiency with minimal computational overhead.
Contribution
The paper presents MemFlow, a novel memory mechanism that dynamically updates and selectively activates relevant frames for consistent long video narration.
Findings
Achieves long-term narrative coherence in video generation.
Maintains high efficiency with only 7.9% speed reduction.
Compatible with existing streaming video models.
Abstract
The core challenge for streaming video generation is maintaining the content consistency in long context, which poses high requirement for the memory design. Most existing solutions maintain the memory by compressing historical frames with predefined strategies. However, different to-generate video chunks should refer to different historical cues, which is hard to satisfy with fixed strategies. In this work, we propose MemFlow to address this problem. Specifically, before generating the coming chunk, we dynamically update the memory bank by retrieving the most relevant historical frames with the text prompt of this chunk. This design enables narrative coherence even if new event happens or scenario switches in future frames. In addition, during generation, we only activate the most relevant tokens in the memory bank for each query in the attention layers, which effectively guarantees…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVideo Analysis and Summarization · Visual Attention and Saliency Detection · Generative Adversarial Networks and Image Synthesis
