STAGE: Storyboard-Anchored Generation for Cinematic Multi-shot Narrative

Peixuan Zhang; Zijian Jia; Kaiqi Liu; Shuchen Weng; Si Li; Boxin Shi

arXiv:2512.12372·cs.CV·March 17, 2026

STAGE: Storyboard-Anchored Generation for Cinematic Multi-shot Narrative

Peixuan Zhang, Zijian Jia, Kaiqi Liu, Shuchen Weng, Si Li, Boxin Shi

PDF

Open Access

TL;DR

STAGE introduces a storyboard-anchored workflow for multi-shot cinematic video generation, improving narrative coherence and cinematic quality through novel structural prediction and memory mechanisms.

Contribution

The paper proposes a new storyboard-based framework with innovative memory and encoding strategies, along with a large annotated dataset for cinematic multi-shot video synthesis.

Findings

01

Outperforms existing methods in narrative control and coherence

02

Achieves high-quality cinematic multi-shot video generation

03

Provides a new dataset with detailed annotations for story and cinematic attributes

Abstract

While recent advancements in generative models have achieved remarkable visual fidelity in video synthesis, creating coherent multi-shot narratives remains a significant challenge. To address this, keyframe-based approaches have emerged as a promising alternative to computationally intensive end-to-end methods, offering the advantages of fine-grained control and greater efficiency. However, these methods often fail to maintain cross-shot consistency and capture cinematic language. In this paper, we introduce STAGE, a SToryboard-Anchored GEneration workflow to reformulate the keyframe-based multi-shot video generation task. Instead of using sparse keyframes, we propose STEP2 to predict a structural storyboard composed of start-end frame pairs for each shot. We introduce the multi-shot memory pack to ensure long-range entity consistency, the dual-encoding strategy for intra-shot…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVideo Analysis and Summarization · Generative Adversarial Networks and Image Synthesis · Human Pose and Action Recognition