Customized Visual Storytelling with Unified Multimodal LLMs
Wei-Hua Li, Cheng Sun, Chu-Song Chen

TL;DR
This paper presents VstoryGen, a multimodal framework for customizable story generation that integrates textual descriptions with character and background references, including shot-type control for cinematic diversity.
Contribution
It introduces VstoryGen, a novel multimodal story generation model with shot-type control and new benchmarks for evaluating story customization.
Findings
VstoryGen improves story consistency and cinematic diversity.
The framework effectively integrates multimodal cues for story customization.
Experiments show superior performance over existing methods.
Abstract
Multimodal story customization aims to generate coherent story flows conditioned on textual descriptions, reference identity images, and shot types. While recent progress in story generation has shown promising results, most approaches rely on text-only inputs. A few studies incorporate character identity cues (e.g., facial ID), but lack broader multimodal conditioning. In this work, we introduce VstoryGen, a multimodal framework that integrates descriptions with character and background references to enable customizable story generation. To enhance cinematic diversity, we introduce shot-type control via parameter-efficient prompt tuning on movie data, enabling the model to generate sequences that more faithfully reflect cinematic grammar. To evaluate our framework, we establish two new benchmarks that assess multimodal story customization from the perspectives of character and scene…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
