InfinityStory: Unlimited Video Generation with World Consistency and Character-Aware Shot Transitions

Mohamed Elmoghany; Liangbing Zhao; Xiaoqian Shen; Subhojyoti Mukherjee; Yang Zhou; Gang Wu; Viet Dac Lai; Seunghyun Yoon; Ryan Rossi; Abdullah Rashwan; Puneet Mathur; Varun Manjunatha; Daksh Dangi; Chien Nguyen; Nedim Lipka; Trung Bui; Krishna Kumar Singh; Ruiyi Zhang; Xiaolei Huang; Jaemin Cho; Yu Wang; Namyong Park; Zhengzhong Tu; Hongjie Chen; Hoda Eldardiry; Nesreen Ahmed; Thien Nguyen; Dinesh Manocha; Mohamed Elhoseiny; and Franck Dernoncourt

arXiv:2603.03646·cs.CV·March 5, 2026

InfinityStory: Unlimited Video Generation with World Consistency and Character-Aware Shot Transitions

Mohamed Elmoghany, Liangbing Zhao, Xiaoqian Shen, Subhojyoti Mukherjee, Yang Zhou, Gang Wu, Viet Dac Lai, Seunghyun Yoon, Ryan Rossi, Abdullah Rashwan, Puneet Mathur, Varun Manjunatha, Daksh Dangi, Chien Nguyen, Nedim Lipka, Trung Bui, Krishna Kumar Singh, Ruiyi Zhang

PDF

Open Access

TL;DR

InfinityStory introduces a comprehensive framework for generating long-form storytelling videos with consistent backgrounds, seamless multi-subject transitions, and scalability, significantly advancing video synthesis quality and coherence.

Contribution

The paper presents a novel background-consistent generation pipeline, a transition-aware synthesis module, and a synthetic dataset to improve long-form video storytelling with multiple subjects.

Findings

01

Achieves 88.94 background consistency on VBench.

02

Attains 82.11 subject consistency, outperforming prior methods.

03

Demonstrates improved stability and temporal coherence in generated videos.

Abstract

Generating long-form storytelling videos with consistent visual narratives remains a significant challenge in video synthesis. We present a novel framework, dataset, and a model that address three critical limitations: background consistency across shots, seamless multi-subject shot-to-shot transitions, and scalability to hour-long narratives. Our approach introduces a background-consistent generation pipeline that maintains visual coherence across scenes while preserving character identity and spatial relationships. We further propose a transition-aware video synthesis module that generates smooth shot transitions for complex scenarios involving multiple subjects entering or exiting frames, going beyond the single-subject limitations of prior work. To support this, we contribute with a synthetic dataset of 10,000 multi-subject transition sequences covering underrepresented dynamic…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Video Analysis and Summarization · Multimodal Machine Learning Applications