DiffSensei: Bridging Multi-Modal LLMs and Diffusion Models for Customized Manga Generation
Jianzong Wu, Chao Tang, Jingbo Wang, Yanhong Zeng, Xiangtai Li, Yunhai, Tong

TL;DR
DiffSensei is a novel framework that combines diffusion models and multimodal large language models to enable highly customizable manga generation with dynamic multi-character control from textual descriptions.
Contribution
The paper introduces DiffSensei, a new method integrating diffusion models with MLLMs for precise, flexible manga creation, and provides MangaZero, a large-scale dataset for this task.
Findings
DiffSensei outperforms existing models in manga generation quality.
It enables flexible character customization based on text cues.
The approach effectively manages multi-character interactions and expressions.
Abstract
Story visualization, the task of creating visual narratives from textual descriptions, has seen progress with text-to-image generation models. However, these models often lack effective control over character appearances and interactions, particularly in multi-character scenes. To address these limitations, we propose a new task: \textbf{customized manga generation} and introduce \textbf{DiffSensei}, an innovative framework specifically designed for generating manga with dynamic multi-character control. DiffSensei integrates a diffusion-based image generator with a multimodal large language model (MLLM) that acts as a text-compatible identity adapter. Our approach employs masked cross-attention to seamlessly incorporate character features, enabling precise layout control without direct pixel transfer. Additionally, the MLLM-based adapter adjusts character features to align with…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDigital Humanities and Scholarship · Handwritten Text Recognition Techniques
MethodsALIGN · Adapter
