The Script is All You Need: An Agentic Framework for Long-Horizon Dialogue-to-Cinematic Video Generation

Chenyu Mu; Xin He; Qu Yang; Wanshun Chen; Jiadi Yao; Huang Liu; Zihao Yi; Bo Zhao; Xingyu Chen; Ruotian Ma; Fanghua Ye; Erkun Yang; Cheng Deng; Zhaopeng Tu; Xiaolong Li; Linus

arXiv:2601.17737·cs.CV·January 28, 2026

The Script is All You Need: An Agentic Framework for Long-Horizon Dialogue-to-Cinematic Video Generation

Chenyu Mu, Xin He, Qu Yang, Wanshun Chen, Jiadi Yao, Huang Liu, Zihao Yi, Bo Zhao, Xingyu Chen, Ruotian Ma, Fanghua Ye, Erkun Yang, Cheng Deng, Zhaopeng Tu, Xiaolong Li, Linus

PDF

Open Access

TL;DR

This paper introduces an agentic framework that translates dialogue into cinematic scripts and orchestrates video generation to produce long, coherent videos, bridging the semantic gap in text-to-video synthesis.

Contribution

We propose ScripterAgent and DirectorAgent, along with ScriptBench and new evaluation metrics, to improve long-horizon, dialogue-driven cinematic video generation.

Findings

01

Enhanced script faithfulness and temporal coherence in generated videos

02

Introduction of ScriptBench, a large-scale multimodal benchmark

03

Identification of a trade-off between visual spectacle and script adherence

Abstract

Recent advances in video generation have produced models capable of synthesizing stunning visual content from simple text prompts. However, these models struggle to generate long-form, coherent narratives from high-level concepts like dialogue, revealing a ``semantic gap'' between a creative idea and its cinematic execution. To bridge this gap, we introduce a novel, end-to-end agentic framework for dialogue-to-cinematic-video generation. Central to our framework is ScripterAgent, a model trained to translate coarse dialogue into a fine-grained, executable cinematic script. To enable this, we construct ScriptBench, a new large-scale benchmark with rich multimodal context, annotated via an expert-guided pipeline. The generated script then guides DirectorAgent, which orchestrates state-of-the-art video models using a cross-scene continuous generation strategy to ensure long-horizon…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications · Artificial Intelligence in Games