An Empirical Study and Analysis of Text-to-Image Generation Using Large   Language Model-Powered Textual Representation

Zhiyu Tan; Mengping Yang; Luozheng Qin; Hao Yang; Ye Qian; Qiang Zhou,; Cheng Zhang; and Hao Li

arXiv:2405.12914·cs.CV·July 19, 2024

An Empirical Study and Analysis of Text-to-Image Generation Using Large Language Model-Powered Textual Representation

Zhiyu Tan, Mengping Yang, Luozheng Qin, Hao Yang, Ye Qian, Qiang Zhou,, Cheng Zhang, and Hao Li

PDF

Open Access 1 Repo

TL;DR

This paper explores using Large Language Models as text encoders in text-to-image generation, enhancing multilingual support and input length handling, through a novel three-stage training pipeline with a lightweight adapter.

Contribution

It introduces a new training pipeline and adapter that integrate LLMs into text-to-image models, improving language understanding and generation quality.

Findings

01

Supports multilingual input

02

Handles longer text prompts effectively

03

Achieves superior image quality

Abstract

One critical prerequisite for faithful text-to-image generation is the accurate understanding of text inputs. Existing methods leverage the text encoder of the CLIP model to represent input prompts. However, the pre-trained CLIP model can merely encode English with a maximum token length of 77. Moreover, the model capacity of the text encoder from CLIP is relatively limited compared to Large Language Models (LLMs), which offer multilingual input, accommodate longer context, and achieve superior text representation. In this paper, we investigate LLMs as the text encoder to improve the language understanding in text-to-image generation. Unfortunately, training text-to-image generative model with LLMs from scratch demands significant computational resources and data. To this end, we introduce a three-stage training pipeline that effectively and efficiently integrates the existing…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

llm-conditioned-diffusion/llm-conditioned-diffusion.github.io
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDiverse Approaches in Healthcare and Education Studies · Computational and Text Analysis Methods · Technology and Data Analysis

MethodsAdapter · Contrastive Language-Image Pre-training