CLIP-GEN: Language-Free Training of a Text-to-Image Generator with CLIP

Zihao Wang; Wei Liu; Qian He; Xinglong Wu; Zili Yi

arXiv:2203.00386·cs.CV·March 2, 2022·31 cites

CLIP-GEN: Language-Free Training of a Text-to-Image Generator with CLIP

Zihao Wang, Wei Liu, Qian He, Xinglong Wu, Zili Yi

PDF

Open Access 2 Repos

TL;DR

CLIP-GEN introduces a self-supervised approach to train a high-quality text-to-image generator using only unlabeled images and a pre-trained CLIP model, eliminating the need for paired text-image data.

Contribution

The paper presents a novel language-free training scheme for text-to-image generation leveraging CLIP embeddings and VQGAN, enabling effective training with unlabeled image datasets.

Findings

01

Outperforms optimization-based methods in image quality.

02

Achieves comparable results to supervised models like CogView.

03

Maintains strong text-image matching without paired data.

Abstract

Training a text-to-image generator in the general domain (e.g., Dall.e, CogView) requires huge amounts of paired text-image data, which is too expensive to collect. In this paper, we propose a self-supervised scheme named as CLIP-GEN for general text-to-image generation with the language-image priors extracted with a pre-trained CLIP model. In our approach, we only require a set of unlabeled images in the general domain to train a text-to-image generator. Specifically, given an image without text labels, we first extract the embedding of the image in the united language-vision embedding space with the image encoder of CLIP. Next, we convert the image into a sequence of discrete tokens in the VQGAN codebook space (the VQGAN model can be trained with the unlabeled image dataset in hand). Finally, we train an autoregressive transformer that maps the image tokens from its unified…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHandwritten Text Recognition Techniques · Generative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications

MethodsContrastive Language-Image Pre-training