TL;DR
This paper introduces CLIP-GLaSS, a zero-shot framework that generates images or captions from a given caption or image by searching latent space with a genetic algorithm guided by CLIP embeddings.
Contribution
It presents a novel zero-shot approach combining CLIP, generative models, and genetic algorithms for cross-modal image-caption generation.
Findings
Effective generation of images from captions and vice versa.
Utilizes BigGAN, StyleGAN2, and GPT-2 for high-quality outputs.
Demonstrates promising results in cross-modal generation tasks.
Abstract
In this research work we present CLIP-GLaSS, a novel zero-shot framework to generate an image (or a caption) corresponding to a given caption (or image). CLIP-GLaSS is based on the CLIP neural network, which, given an image and a descriptive caption, provides similar embeddings. Differently, CLIP-GLaSS takes a caption (or an image) as an input, and generates the image (or the caption) whose CLIP embedding is the most similar to the input one. This optimal image (or caption) is produced via a generative network, after an exploration by a genetic algorithm. Promising results are shown, based on the experimentation of the image Generators BigGAN and StyleGAN2, and of the text Generator GPT2
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsSoftmax · Dense Connections · Adam · Linear Layer · Six Ways To Communicate To Someone At Expedia Via Phone And Email's. · 1x1 Convolution · Feedforward Network · Off-Diagonal Orthogonal Regularization · Projection Discriminator · ((Reservation@Faqs))How do I cancel a reservation on Expedia?
