StoryMem: Multi-shot Long Video Storytelling with Memory

Kaiwen Zhang; Liming Jiang; Angtian Wang; Jacob Zhiyuan Fang; Tiancheng Zhi; Qing Yan; Hao Kang; Xin Lu; Xingang Pan

arXiv:2512.19539·cs.CV·December 23, 2025

StoryMem: Multi-shot Long Video Storytelling with Memory

Kaiwen Zhang, Liming Jiang, Angtian Wang, Jacob Zhiyuan Fang, Tiancheng Zhi, Qing Yan, Hao Kang, Xin Lu, Xingang Pan

PDF

Open Access 1 Models

TL;DR

StoryMem introduces a memory-augmented diffusion framework for long-form video storytelling, enabling coherent, high-quality multi-shot videos by dynamically maintaining visual memory and fine-tuning pre-trained models.

Contribution

The paper presents a novel Memory-to-Video design that transforms single-shot diffusion models into multi-shot storytellers with explicit visual memory and a new benchmark for evaluation.

Findings

01

Achieves superior cross-shot consistency compared to previous methods.

02

Maintains high aesthetic quality and adherence to prompts.

03

Supports smooth shot transitions and customized storytelling.

Abstract

Visual storytelling requires generating multi-shot videos with cinematic quality and long-range consistency. Inspired by human memory, we propose StoryMem, a paradigm that reformulates long-form video storytelling as iterative shot synthesis conditioned on explicit visual memory, transforming pre-trained single-shot video diffusion models into multi-shot storytellers. This is achieved by a novel Memory-to-Video (M2V) design, which maintains a compact and dynamically updated memory bank of keyframes from historical generated shots. The stored memory is then injected into single-shot video diffusion models via latent concatenation and negative RoPE shifts with only LoRA fine-tuning. A semantic keyframe selection strategy, together with aesthetic preference filtering, further ensures informative and stable memory throughout generation. Moreover, the proposed framework naturally…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
Kevin-thu/StoryMem
model· 51 dl· ♡ 93
51 dl♡ 93

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVideo Analysis and Summarization · Generative Adversarial Networks and Image Synthesis · Human Pose and Action Recognition