A Comprehensive Study of Decoder-Only LLMs for Text-to-Image Generation
Andrew Z. Wang, Songwei Ge, Tero Karras, Ming-Yu Liu, Yogesh Balaji

TL;DR
This study evaluates modern decoder-only large language models as text encoders in text-to-image diffusion, revealing that multi-layer normalized embeddings outperform traditional last-layer embeddings, improving image generation quality.
Contribution
It introduces a standardized pipeline for assessing LLM-based text encodings in text-to-image models and demonstrates the superiority of multi-layer normalized embeddings over last-layer embeddings.
Findings
Multi-layer normalized embeddings improve prompt alignment.
LLMs with this approach outperform T5 baseline.
Enhanced reasoning skills in generated images.
Abstract
Both text-to-image generation and large language models (LLMs) have made significant advancements. However, many text-to-image models still employ the somewhat outdated T5 and CLIP as their text encoders. In this work, we investigate the effectiveness of using modern decoder-only LLMs as text encoders for text-to-image diffusion models. We build a standardized training and evaluation pipeline that allows us to isolate and evaluate the effect of different text embeddings. We train a total of 27 text-to-image models with 12 different text encoders to analyze the critical aspects of LLMs that could impact text-to-image generation, including the approaches to extract embeddings, different LLMs variants, and model sizes. Our experiments reveal that the de facto way of using last-layer embeddings as conditioning leads to inferior performance. Instead, we explore embeddings from various layers…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Topic Modeling
MethodsGated Linear Unit · Refunds@Expedia|||How do I get a full refund from Expedia? · Layer Normalization · Attention Dropout · Adafactor · Inverse Square Root Schedule · Softmax · Linear Layer · Dropout · Dense Connections
