HoloCine: Holistic Generation of Cinematic Multi-Shot Long Video Narratives

Yihao Meng; Hao Ouyang; Yue Yu; Qiuyu Wang; Wen Wang; Ka Leong Cheng; Hanlin Wang; Yixuan Li; Cheng Chen; Yanhong Zeng; Yujun Shen; Huamin Qu

arXiv:2510.20822·cs.CV·October 24, 2025

HoloCine: Holistic Generation of Cinematic Multi-Shot Long Video Narratives

Yihao Meng, Hao Ouyang, Yue Yu, Qiuyu Wang, Wen Wang, Ka Leong Cheng, Hanlin Wang, Yixuan Li, Cheng Chen, Yanhong Zeng, Yujun Shen, Huamin Qu

PDF

Open Access

TL;DR

HoloCine introduces a holistic model for generating long, coherent cinematic videos with global consistency, directorial control, and emergent cinematic understanding, advancing automated filmmaking.

Contribution

It presents a novel architecture with Window Cross-Attention and Sparse Inter-Shot Self-Attention for coherent, long-form video narrative generation, surpassing previous clip-based models.

Findings

01

Achieves state-of-the-art narrative coherence in long videos

02

Develops persistent memory for characters and scenes

03

Demonstrates emergent cinematic techniques understanding

Abstract

State-of-the-art text-to-video models excel at generating isolated clips but fall short of creating the coherent, multi-shot narratives, which are the essence of storytelling. We bridge this "narrative gap" with HoloCine, a model that generates entire scenes holistically to ensure global consistency from the first shot to the last. Our architecture achieves precise directorial control through a Window Cross-Attention mechanism that localizes text prompts to specific shots, while a Sparse Inter-Shot Self-Attention pattern (dense within shots but sparse between them) ensures the efficiency required for minute-scale generation. Beyond setting a new state-of-the-art in narrative coherence, HoloCine develops remarkable emergent abilities: a persistent memory for characters and scenes, and an intuitive grasp of cinematic techniques. Our work marks a pivotal shift from clip synthesis towards…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Video Analysis and Summarization · Multimodal Machine Learning Applications