LAFITE: Towards Language-Free Training for Text-to-Image Generation

Yufan Zhou; Ruiyi Zhang; Changyou Chen; Chunyuan Li; Chris Tensmeyer,; Tong Yu; Jiuxiang Gu; Jinhui Xu; Tong Sun

arXiv:2111.13792·cs.CV·March 25, 2022·37 cites

LAFITE: Towards Language-Free Training for Text-to-Image Generation

Yufan Zhou, Ruiyi Zhang, Changyou Chen, Chunyuan Li, Chris Tensmeyer,, Tong Yu, Jiuxiang Gu, Jinhui Xu, Tong Sun

PDF

Open Access 3 Repos

TL;DR

This paper introduces a novel approach for text-to-image generation that eliminates the need for paired text data by utilizing the semantic space of the CLIP model, achieving state-of-the-art results efficiently.

Contribution

It presents the first language-free training method for text-to-image models using CLIP's semantic space, reducing data and training costs while maintaining high quality.

Findings

01

Achieves state-of-the-art results in text-to-image generation.

02

Outperforms many models trained with full image-text pairs.

03

Requires only 1% of the data and model size of large models like DALL-E.

Abstract

One of the major challenges in training text-to-image generation models is the need of a large number of high-quality image-text pairs. While image samples are often easily accessible, the associated text descriptions typically require careful human captioning, which is particularly time- and cost-consuming. In this paper, we propose the first work to train text-to-image generation models without any text data. Our method leverages the well-aligned multi-modal semantic space of the powerful pre-trained CLIP model: the requirement of text-conditioning is seamlessly alleviated via generating text features from image features. Extensive experiments are conducted to illustrate the effectiveness of the proposed method. We obtain state-of-the-art results in the standard text-to-image generation tasks. Importantly, the proposed language-free model outperforms most existing models trained with…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Video Analysis and Summarization · Generative Adversarial Networks and Image Synthesis

MethodsContrastive Language-Image Pre-training