FuseDream: Training-Free Text-to-Image Generation with Improved CLIP+GAN Space Optimization
Xingchao Liu, Chengyue Gong, Lemeng Wu, Shujian Zhang, Hao Su, Qiang, Liu

TL;DR
FuseDream enhances zero-shot text-to-image generation by optimizing in GAN latent space with robust CLIP scoring, novel initialization, and image composition techniques, achieving high-quality results without training.
Contribution
It introduces FuseDream, a training-free pipeline that improves CLIP+GAN image generation through advanced optimization, augmentation, and composition strategies.
Findings
Achieves top Inception and FID scores on MS COCO.
Generates diverse, high-quality images from text prompts.
Extends GAN capabilities with novel composition and optimization methods.
Abstract
Generating images from natural language instructions is an intriguing yet highly challenging task. We approach text-to-image generation by combining the power of the retrained CLIP representation with an off-the-shelf image generator (GANs), optimizing in the latent space of GAN to find images that achieve maximum CLIP score with the given input text. Compared to traditional methods that train generative models from text to image starting from scratch, the CLIP+GAN approach is training-free, zero shot and can be easily customized with different generators. However, optimizing CLIP score in the GAN space casts a highly challenging optimization problem and off-the-shelf optimizers such as Adam fail to yield satisfying results. In this work, we propose a FuseDream pipeline, which improves the CLIP+GAN approach with three key techniques: 1) an AugCLIP score which robustifies the CLIP…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Advanced Image and Video Retrieval Techniques · Multimodal Machine Learning Applications
MethodsContrastive Language-Image Pre-training
