Lights, Camera, Consistency: A Multistage Pipeline for Character-Stable AI Video Stories
Chayan Jain, Rishant Sharma, Archit Garg, Ishan Bhanuka, Pratik Narang, Dhruv Kumar

TL;DR
This paper presents a multistage pipeline for generating long, character-consistent AI video stories by combining large language models, text-to-image models, and video synthesis, emphasizing visual anchoring for identity preservation.
Contribution
It introduces a filmmaker-like multistage approach that improves character consistency in AI-generated videos and analyzes cultural biases in current models.
Findings
Removing visual anchoring drastically reduces character consistency scores.
The pipeline significantly enhances identity preservation in long video stories.
Cultural biases affect subject consistency and dynamic behavior in generated videos.
Abstract
Generating long, cohesive video stories with consistent characters is a significant challenge for current text-to-video AI. We introduce a method that approaches video generation in a filmmaker-like manner. Instead of creating a video in one step, our proposed pipeline first uses a large language model to generate a detailed production script. This script guides a text-to-image model in creating consistent visuals for each character, which then serve as anchors for a video generation model to synthesize each scene individually. Our baseline comparisons validate the necessity of this multi-stage decomposition; specifically, we observe that removing the visual anchoring mechanism results in a catastrophic drop in character consistency scores (from 7.99 to 0.55), confirming that visual priors are essential for identity preservation. Furthermore, we analyze cultural disparities in current…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications · Face recognition and analysis
