Character-Centered Dialogue Generation from Scene-Level Prompts
Taewon Kang, Ming C. Lin

TL;DR
This paper introduces a modular, training-free pipeline for generating character-driven dialogue in scene-based video narratives, integrating visual and auditory grounding with narrative memory.
Contribution
It presents a novel, scalable framework that combines visual semantics, structured prompts, and a recursive memory to produce coherent, expressive, and character-consistent dialogue in video storytelling.
Findings
Generates fully voiced, multimodal video narratives from prompts.
Maintains character and emotional consistency across scenes.
Generalizes across diverse story settings without training.
Abstract
Recent advances in scene-based video generation enable coherent visual narratives from structured prompts, yet a key aspect of storytelling -- character-driven dialogue and speech -- remains underexplored. We present a modular pipeline that transforms action-level prompts into visually and auditorily grounded dialogue, enriching scene-based storytelling with natural voice and character expression. Our method takes a pair of prompts per scene, defining the setting and character behavior. While a story generation model such as Text2Story produces the visual scene, we focus on generating expressive, character-consistent utterances grounded in both the prompts and a representative scene image. A pretrained vision-language encoder extracts high-level visual semantics, which are combined with structured prompts to guide a large language model for dialogue synthesis. To maintain contextual and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsFocus
