STAGE: Storyboard-Anchored Generation for Cinematic Multi-shot Narrative
Peixuan Zhang, Zijian Jia, Kaiqi Liu, Shuchen Weng, Si Li, Boxin Shi

TL;DR
STAGE introduces a storyboard-anchored workflow for multi-shot cinematic video generation, improving narrative coherence and cinematic quality through novel structural prediction and memory mechanisms.
Contribution
The paper proposes a new storyboard-based framework with innovative memory and encoding strategies, along with a large annotated dataset for cinematic multi-shot video synthesis.
Findings
Outperforms existing methods in narrative control and coherence
Achieves high-quality cinematic multi-shot video generation
Provides a new dataset with detailed annotations for story and cinematic attributes
Abstract
While recent advancements in generative models have achieved remarkable visual fidelity in video synthesis, creating coherent multi-shot narratives remains a significant challenge. To address this, keyframe-based approaches have emerged as a promising alternative to computationally intensive end-to-end methods, offering the advantages of fine-grained control and greater efficiency. However, these methods often fail to maintain cross-shot consistency and capture cinematic language. In this paper, we introduce STAGE, a SToryboard-Anchored GEneration workflow to reformulate the keyframe-based multi-shot video generation task. Instead of using sparse keyframes, we propose STEP2 to predict a structural storyboard composed of start-end frame pairs for each shot. We introduce the multi-shot memory pack to ensure long-range entity consistency, the dual-encoding strategy for intra-shot…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVideo Analysis and Summarization · Generative Adversarial Networks and Image Synthesis · Human Pose and Action Recognition
