A Survey on Long-Video Storytelling Generation: Architectures, Consistency, and Cinematic Quality

Mohamed Elmoghany; Ryan Rossi; Seunghyun Yoon; Subhojyoti Mukherjee; Eslam Bakr; Puneet Mathur; Gang Wu; Viet Dac Lai; Nedim Lipka; Ruiyi Zhang; Varun Manjunatha; Chien Nguyen; Daksh Dangi; Abel Salinas; Mohammad Taesiri; Hongjie Chen; Xiaolei Huang; Joe Barrow; Nesreen Ahmed; Hoda Eldardiry; Namyong Park; Yu Wang; Jaemin Cho; Anh Totti Nguyen; Zhengzhong Tu; Thien Nguyen; Dinesh Manocha; Mohamed Elhoseiny; Franck Dernoncourt

arXiv:2507.07202·cs.CV·July 11, 2025

A Survey on Long-Video Storytelling Generation: Architectures, Consistency, and Cinematic Quality

Mohamed Elmoghany, Ryan Rossi, Seunghyun Yoon, Subhojyoti Mukherjee, Eslam Bakr, Puneet Mathur, Gang Wu, Viet Dac Lai, Nedim Lipka, Ruiyi Zhang, Varun Manjunatha, Chien Nguyen, Daksh Dangi, Abel Salinas, Mohammad Taesiri, Hongjie Chen, Xiaolei Huang, Joe Barrow, Nesreen Ahmed

PDF

Open Access

TL;DR

This survey reviews advances in long-video storytelling generation, highlighting architectural strategies, challenges in maintaining consistency, and quality issues in videos exceeding 16 seconds.

Contribution

It provides a comprehensive taxonomy and comparative analysis of 32 existing methods for long-video generation, identifying key architectural components and training strategies.

Findings

01

Most methods struggle with character consistency and scene coherence.

02

Long videos often suffer from frame redundancy and low temporal diversity.

03

Recent approaches improve narrative coherence and detail in videos over 150 seconds.

Abstract

Despite the significant progress that has been made in video generative models, existing state-of-the-art methods can only produce videos lasting 5-16 seconds, often labeled "long-form videos". Furthermore, videos exceeding 16 seconds struggle to maintain consistent character appearances and scene layouts throughout the narrative. In particular, multi-subject long videos still fail to preserve character consistency and motion coherence. While some methods can generate videos up to 150 seconds long, they often suffer from frame redundancy and low temporal diversity. Recent work has attempted to produce long-form videos featuring multiple characters, narrative coherence, and high-fidelity detail. We comprehensively studied 32 papers on video generation to identify key architectural components and training strategies that consistently yield these qualities. We also construct a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Human Motion and Animation · Video Analysis and Summarization