GALIP: Generative Adversarial CLIPs for Text-to-Image Synthesis
Ming Tao, Bing-Kun Bao, Hao Tang, Changsheng Xu

TL;DR
GALIP introduces a novel generative adversarial framework leveraging CLIP for efficient, controllable, and high-quality text-to-image synthesis, significantly reducing data and parameter requirements while increasing speed.
Contribution
The paper presents GALIP, a CLIP-based GAN model that improves efficiency, control, and speed in text-to-image synthesis compared to existing large models.
Findings
Requires only 3% of training data of large models
Achieves 120 times faster image synthesis
Maintains high image quality and controllability
Abstract
Synthesizing high-fidelity complex images from text is challenging. Based on large pretraining, the autoregressive and diffusion models can synthesize photo-realistic images. Although these large models have shown notable progress, there remain three flaws. 1) These models require tremendous training data and parameters to achieve good performance. 2) The multi-step generation design slows the image synthesis process heavily. 3) The synthesized visual features are difficult to control and require delicately designed prompts. To enable high-quality, efficient, fast, and controllable text-to-image synthesis, we propose Generative Adversarial CLIPs, namely GALIP. GALIP leverages the powerful pretrained CLIP model both in the discriminator and generator. Specifically, we propose a CLIP-based discriminator. The complex scene understanding ability of CLIP enables the discriminator to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Advanced Image and Video Retrieval Techniques · Image Processing and 3D Reconstruction
MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · Diffusion · Contrastive Language-Image Pre-training
