EMMA: Your Text-to-Image Diffusion Model Can Secretly Accept Multi-Modal   Prompts

Yucheng Han; Rui Wang; Chi Zhang; Juntao Hu; Pei Cheng; Bin Fu,; Hanwang Zhang

arXiv:2406.09162·cs.CV·June 14, 2024

EMMA: Your Text-to-Image Diffusion Model Can Secretly Accept Multi-Modal Prompts

Yucheng Han, Rui Wang, Chi Zhang, Juntao Hu, Pei Cheng, Bin Fu,, Hanwang Zhang

PDF

Open Access 1 Repo

TL;DR

EMMA is a novel multi-modal image generation model built on a pre-trained text-to-image diffusion framework, effectively integrating multiple modalities through a special attention mechanism and enabling flexible, personalized, and context-aware image synthesis.

Contribution

The paper introduces EMMA, a flexible multi-modal image generation model that leverages the pre-trained T2I diffusion model's hidden capacity to accept multi-modal prompts without retraining.

Findings

01

EMMA maintains high fidelity and detail in generated images.

02

The pre-trained T2I diffusion model can secretly accept multi-modal prompts.

03

EMMA can produce images conditioned on multiple modalities simultaneously.

Abstract

Recent advancements in image generation have enabled the creation of high-quality images from text conditions. However, when facing multi-modal conditions, such as text combined with reference appearances, existing methods struggle to balance multiple conditions effectively, typically showing a preference for one modality over others. To address this challenge, we introduce EMMA, a novel image generation model accepting multi-modal prompts built upon the state-of-the-art text-to-image (T2I) diffusion model, ELLA. EMMA seamlessly incorporates additional modalities alongside text to guide image generation through an innovative Multi-modal Feature Connector design, which effectively integrates textual and supplementary modal information using a special attention mechanism. By freezing all parameters in the original T2I diffusion model and only adjusting some additional layers, we reveal an…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

tencentqqgylab/ella
pytorch

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Data Visualization and Analytics · Advanced Text Analysis Techniques

MethodsDiffusion