CLIP-GEN: Language-Free Training of a Text-to-Image Generator with CLIP
Zihao Wang, Wei Liu, Qian He, Xinglong Wu, Zili Yi

TL;DR
CLIP-GEN introduces a self-supervised approach to train a high-quality text-to-image generator using only unlabeled images and a pre-trained CLIP model, eliminating the need for paired text-image data.
Contribution
The paper presents a novel language-free training scheme for text-to-image generation leveraging CLIP embeddings and VQGAN, enabling effective training with unlabeled image datasets.
Findings
Outperforms optimization-based methods in image quality.
Achieves comparable results to supervised models like CogView.
Maintains strong text-image matching without paired data.
Abstract
Training a text-to-image generator in the general domain (e.g., Dall.e, CogView) requires huge amounts of paired text-image data, which is too expensive to collect. In this paper, we propose a self-supervised scheme named as CLIP-GEN for general text-to-image generation with the language-image priors extracted with a pre-trained CLIP model. In our approach, we only require a set of unlabeled images in the general domain to train a text-to-image generator. Specifically, given an image without text labels, we first extract the embedding of the image in the united language-vision embedding space with the image encoder of CLIP. Next, we convert the image into a sequence of discrete tokens in the VQGAN codebook space (the VQGAN model can be trained with the unlabeled image dataset in hand). Finally, we train an autoregressive transformer that maps the image tokens from its unified…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHandwritten Text Recognition Techniques · Generative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications
MethodsContrastive Language-Image Pre-training
