Exploring the Role of Large Language Models in Prompt Encoding for   Diffusion Models

Bingqi Ma; Zhuofan Zong; Guanglu Song; Hongsheng Li; and Yu Liu

arXiv:2406.11831·cs.CV·December 6, 2024

Exploring the Role of Large Language Models in Prompt Encoding for Diffusion Models

Bingqi Ma, Zhuofan Zong, Guanglu Song, Hongsheng Li, and Yu Liu

PDF

Open Access

TL;DR

This paper introduces a novel framework that effectively integrates large language models into diffusion-based text-to-image generation, overcoming previous limitations and significantly improving prompt understanding and image quality.

Contribution

The paper proposes a new method to harness LLMs for prompt encoding in diffusion models, including a design to eliminate positional bias and fuse multiple LLMs, leading to superior performance.

Findings

01

LI-DiT surpasses state-of-the-art models like Stable Diffusion 3 and DALL-E 3.

02

The framework effectively fuses multiple LLMs for enhanced prompt understanding.

03

Experimental results validate the scalability and robustness of LI-DiT.

Abstract

Large language models (LLMs) based on decoder-only transformers have demonstrated superior text understanding capabilities compared to CLIP and T5-series models. However, the paradigm for utilizing current advanced LLMs in text-to-image diffusion models remains to be explored. We observed an unusual phenomenon: directly using a large language model as the prompt encoder significantly degrades the prompt-following ability in image generation. We identified two main obstacles behind this issue. One is the misalignment between the next token prediction training in LLM and the requirement for discriminative prompt features in diffusion models. The other is the intrinsic positional bias introduced by the decoder-only architecture. To deal with this issue, we propose a novel framework to fully harness the capabilities of LLMs. Through the carefully designed usage guidance, we effectively…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling

MethodsResidual Connection · Softmax · Layer Normalization · Contrastive Language-Image Pre-training · Byte Pair Encoding · Label Smoothing · Diffusion · Adam · Attention Is All You Need · Linear Layer