ELLA: Equip Diffusion Models with LLM for Enhanced Semantic Alignment
Xiwei Hu, Rui Wang, Yixiao Fang, Bin Fu, Pei Cheng, and Gang Yu

TL;DR
ELLA enhances diffusion models' understanding of complex, dense prompts by integrating large language models with a novel semantic connector, significantly improving prompt comprehension and image generation quality.
Contribution
This paper introduces ELLA, a method to incorporate LLMs into diffusion models for better semantic alignment without additional training, and proposes DPG-Bench for evaluating dense prompt following.
Findings
ELLA outperforms existing methods in dense prompt following.
Improves handling of multiple objects and complex relationships.
Demonstrates effectiveness on the DPG-Bench benchmark.
Abstract
Diffusion models have demonstrated remarkable performance in the domain of text-to-image generation. However, most widely used models still employ CLIP as their text encoder, which constrains their ability to comprehend dense prompts, encompassing multiple objects, detailed attributes, complex relationships, long-text alignment, etc. In this paper, we introduce an Efficient Large Language Model Adapter, termed ELLA, which equips text-to-image diffusion models with powerful Large Language Models (LLM) to enhance text alignment without training of either U-Net or LLM. To seamlessly bridge two pre-trained models, we investigate a range of semantic alignment connector designs and propose a novel module, the Timestep-Aware Semantic Connector (TSC), which dynamically extracts timestep-dependent conditions from LLM. Our approach adapts semantic features at different stages of the denoising…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling
MethodsMax Pooling · Diffusion · *Communicated@Fast*How Do I Communicate to Expedia? · Convolution · Concatenated Skip Connection · Adapter · Contrastive Language-Image Pre-training · U-Net
