DreamingComics: A Story Visualization Pipeline via Subject and Layout Customized Generation using Video Models

Patrick Kwon; Chen Chen

arXiv:2512.01686·cs.CV·December 2, 2025

DreamingComics: A Story Visualization Pipeline via Subject and Layout Customized Generation using Video Models

Patrick Kwon, Chen Chen

PDF

Open Access

TL;DR

DreamingComics is a novel story visualization framework that uses a pretrained video model, region-aware positional encoding, and language-based layout generation to improve character consistency, style similarity, and spatial accuracy in comic-style image synthesis.

Contribution

It introduces RegionalRoPE for layout control and integrates an LLM-based layout generator, advancing controllability and consistency in story visualization.

Findings

01

29.2% increase in character consistency

02

36.2% increase in style similarity

03

High spatial accuracy achieved

Abstract

Current story visualization methods tend to position subjects solely by text and face challenges in maintaining artistic consistency. To address these limitations, we introduce DreamingComics, a layout-aware story visualization framework. We build upon a pretrained video diffusion-transformer (DiT) model, leveraging its spatiotemporal priors to enhance identity and style consistency. For layout-based position control, we propose RegionalRoPE, a region-aware positional encoding scheme that re-indexes embeddings based on the target layout. Additionally, we introduce a masked condition loss to further constrain each subject's visual features to their designated region. To infer layouts from natural language scripts, we integrate an LLM-based layout generator trained to produce comic-style layouts, enabling flexible and controllable layout conditioning. We present a comprehensive evaluation…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Data Visualization and Analytics · Multimodal Machine Learning Applications