OneStory: Coherent Multi-Shot Video Generation with Adaptive Memory

Zhaochong An; Menglin Jia; Haonan Qiu; Zijian Zhou; Xiaoke Huang; Zhiheng Liu; Weiming Ren; Kumara Kahatapitiya; Ding Liu; Sen He; Chenyang Zhang; Tao Xiang; Fanny Yang; Serge Belongie; Tian Xie

arXiv:2512.07802·cs.CV·December 9, 2025

OneStory: Coherent Multi-Shot Video Generation with Adaptive Memory

Zhaochong An, Menglin Jia, Haonan Qiu, Zijian Zhou, Xiaoke Huang, Zhiheng Liu, Weiming Ren, Kumara Kahatapitiya, Ding Liu, Sen He, Chenyang Zhang, Tao Xiang, Fanny Yang, Serge Belongie, Tian Xie

PDF

Open Access

TL;DR

OneStory introduces a novel approach for multi-shot video generation that models long-range context using global memory and adaptive conditioning, significantly improving narrative coherence in complex storytelling scenarios.

Contribution

The paper presents a new framework that reformulates multi-shot video generation as a next-shot prediction task, utilizing pretrained models and novel modules for better long-range context modeling.

Findings

01

Achieves state-of-the-art coherence in multi-shot videos

02

Effective in both text- and image-conditioned storytelling

03

Supports controllable, long-form video generation

Abstract

Storytelling in real-world videos often unfolds through multiple shots -- discontinuous yet semantically connected clips that together convey a coherent narrative. However, existing multi-shot video generation (MSV) methods struggle to effectively model long-range cross-shot context, as they rely on limited temporal windows or single keyframe conditioning, leading to degraded performance under complex narratives. In this work, we propose OneStory, enabling global yet compact cross-shot context modeling for consistent and scalable narrative generation. OneStory reformulates MSV as a next-shot generation task, enabling autoregressive shot synthesis while leveraging pretrained image-to-video (I2V) models for strong visual conditioning. We introduce two key modules: a Frame Selection module that constructs a semantically-relevant global memory based on informative frames from prior shots,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications · Video Analysis and Summarization