Lights, Camera, Consistency: A Multistage Pipeline for Character-Stable AI Video Stories

Chayan Jain; Rishant Sharma; Archit Garg; Ishan Bhanuka; Pratik Narang; Dhruv Kumar

arXiv:2512.16954·cs.CV·December 22, 2025

Lights, Camera, Consistency: A Multistage Pipeline for Character-Stable AI Video Stories

Chayan Jain, Rishant Sharma, Archit Garg, Ishan Bhanuka, Pratik Narang, Dhruv Kumar

PDF

Open Access

TL;DR

This paper presents a multistage pipeline for generating long, character-consistent AI video stories by combining large language models, text-to-image models, and video synthesis, emphasizing visual anchoring for identity preservation.

Contribution

It introduces a filmmaker-like multistage approach that improves character consistency in AI-generated videos and analyzes cultural biases in current models.

Findings

01

Removing visual anchoring drastically reduces character consistency scores.

02

The pipeline significantly enhances identity preservation in long video stories.

03

Cultural biases affect subject consistency and dynamic behavior in generated videos.

Abstract

Generating long, cohesive video stories with consistent characters is a significant challenge for current text-to-video AI. We introduce a method that approaches video generation in a filmmaker-like manner. Instead of creating a video in one step, our proposed pipeline first uses a large language model to generate a detailed production script. This script guides a text-to-image model in creating consistent visuals for each character, which then serve as anchors for a video generation model to synthesize each scene individually. Our baseline comparisons validate the necessity of this multi-stage decomposition; specifically, we observe that removing the visual anchoring mechanism results in a catastrophic drop in character consistency scores (from 7.99 to 0.55), confirming that visual priors are essential for identity preservation. Furthermore, we analyze cultural disparities in current…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications · Face recognition and analysis