DreamingComics: A Story Visualization Pipeline via Subject and Layout Customized Generation using Video Models
Patrick Kwon, Chen Chen

TL;DR
DreamingComics is a novel story visualization framework that uses a pretrained video model, region-aware positional encoding, and language-based layout generation to improve character consistency, style similarity, and spatial accuracy in comic-style image synthesis.
Contribution
It introduces RegionalRoPE for layout control and integrates an LLM-based layout generator, advancing controllability and consistency in story visualization.
Findings
29.2% increase in character consistency
36.2% increase in style similarity
High spatial accuracy achieved
Abstract
Current story visualization methods tend to position subjects solely by text and face challenges in maintaining artistic consistency. To address these limitations, we introduce DreamingComics, a layout-aware story visualization framework. We build upon a pretrained video diffusion-transformer (DiT) model, leveraging its spatiotemporal priors to enhance identity and style consistency. For layout-based position control, we propose RegionalRoPE, a region-aware positional encoding scheme that re-indexes embeddings based on the target layout. Additionally, we introduce a masked condition loss to further constrain each subject's visual features to their designated region. To infer layouts from natural language scripts, we integrate an LLM-based layout generator trained to produce comic-style layouts, enabling flexible and controllable layout conditioning. We present a comprehensive evaluation…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Data Visualization and Analytics · Multimodal Machine Learning Applications
