CLIPAG: Towards Generator-Free Text-to-Image Generation

Roy Ganz; Michael Elad

arXiv:2306.16805·cs.CV·September 4, 2023·1 cites

CLIPAG: Towards Generator-Free Text-to-Image Generation

Roy Ganz, Michael Elad

PDF

Open Access 1 Video

TL;DR

This paper introduces CLIPAG, a method that leverages perceptually aligned gradients in robust vision-language models to enable generator-free text-to-image synthesis, improving vision-language tasks without large generative models.

Contribution

It extends the study of perceptually aligned gradients to vision-language models and demonstrates their utility for generator-free text-to-image generation.

Findings

01

Robust CLIP models exhibit perceptually aligned gradients.

02

CLIPAG improves performance in vision-language generative tasks.

03

Enables text-to-image generation without large generative models.

Abstract

Perceptually Aligned Gradients (PAG) refer to an intriguing property observed in robust image classification models, wherein their input gradients align with human perception and pose semantic meanings. While this phenomenon has gained significant research attention, it was solely studied in the context of unimodal vision-only architectures. In this work, we extend the study of PAG to Vision-Language architectures, which form the foundations for diverse image-text tasks and applications. Through an adversarial robustification finetuning of CLIP, we demonstrate that robust Vision-Language models exhibit PAG in contrast to their vanilla counterparts. This work reveals the merits of CLIP with PAG (CLIPAG) in several vision-language generative tasks. Notably, we show that seamlessly integrating CLIPAG in a "plug-n-play" manner leads to substantial improvements in vision-language generative…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

CLIPAG: Towards Generator-Free Text-to-Image Generation· youtube

Taxonomy

TopicsMultimodal Machine Learning Applications

MethodsALIGN · Contrastive Language-Image Pre-training