StoryMem: Multi-shot Long Video Storytelling with Memory
Kaiwen Zhang, Liming Jiang, Angtian Wang, Jacob Zhiyuan Fang, Tiancheng Zhi, Qing Yan, Hao Kang, Xin Lu, Xingang Pan

TL;DR
StoryMem introduces a memory-augmented diffusion framework for long-form video storytelling, enabling coherent, high-quality multi-shot videos by dynamically maintaining visual memory and fine-tuning pre-trained models.
Contribution
The paper presents a novel Memory-to-Video design that transforms single-shot diffusion models into multi-shot storytellers with explicit visual memory and a new benchmark for evaluation.
Findings
Achieves superior cross-shot consistency compared to previous methods.
Maintains high aesthetic quality and adherence to prompts.
Supports smooth shot transitions and customized storytelling.
Abstract
Visual storytelling requires generating multi-shot videos with cinematic quality and long-range consistency. Inspired by human memory, we propose StoryMem, a paradigm that reformulates long-form video storytelling as iterative shot synthesis conditioned on explicit visual memory, transforming pre-trained single-shot video diffusion models into multi-shot storytellers. This is achieved by a novel Memory-to-Video (M2V) design, which maintains a compact and dynamically updated memory bank of keyframes from historical generated shots. The stored memory is then injected into single-shot video diffusion models via latent concatenation and negative RoPE shifts with only LoRA fine-tuning. A semantic keyframe selection strategy, together with aesthetic preference filtering, further ensures informative and stable memory throughout generation. Moreover, the proposed framework naturally…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVideo Analysis and Summarization · Generative Adversarial Networks and Image Synthesis · Human Pose and Action Recognition
