DiffBlender: Composable and Versatile Multimodal Text-to-Image Diffusion Models

Sungnyun Kim; Junsoo Lee; Kibeom Hong; Daesik Kim; Namhyuk Ahn

arXiv:2305.15194·cs.CV·August 27, 2025·1 cites

DiffBlender: Composable and Versatile Multimodal Text-to-Image Diffusion Models

Sungnyun Kim, Junsoo Lee, Kibeom Hong, Daesik Kim, Namhyuk Ahn

PDF

Open Access 1 Repo 1 Models

TL;DR

DiffBlender introduces a unified multimodal diffusion model that effectively combines structure, layout, and attribute inputs for versatile and high-quality text-to-image generation without extensive retraining.

Contribution

It presents a novel framework capable of integrating multiple modalities into a single diffusion model with minimal parameter updates, setting new benchmarks in multimodal image synthesis.

Findings

01

Achieves state-of-the-art results in multimodal generation benchmarks.

02

Successfully integrates multiple modalities without modifying pre-trained diffusion models.

03

Demonstrates diverse applications in detailed image synthesis.

Abstract

In this study, we aim to enhance the capabilities of diffusion-based text-to-image (T2I) generation models by integrating diverse modalities beyond textual descriptions within a unified framework. To this end, we categorize widely used conditional inputs into three modality types: structure, layout, and attribute. We propose a multimodal T2I diffusion model, which is capable of processing all three modalities within a single architecture without modifying the parameters of the pre-trained diffusion model, as only a small subset of components is updated. Our approach sets new benchmarks in multimodal generation through extensive quantitative and qualitative comparisons with existing conditional generation methods. We demonstrate that DiffBlender effectively integrates multiple sources of information and supports diverse applications in detailed image synthesis. The code and demo are…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

sungnyun/diffblender
pytorchOfficial

Models

🤗
sungnyun/diffblender
model· ♡ 1
♡ 1

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Image Retrieval and Classification Techniques · Advanced Image and Video Retrieval Techniques

MethodsDiffusion