Text-to-Image Generation Via Energy-Based CLIP

Roy Ganz; Michael Elad

arXiv:2408.17046·cs.CV·July 29, 2025

Text-to-Image Generation Via Energy-Based CLIP

Roy Ganz, Michael Elad

PDF

Open Access

TL;DR

CLIP-JEM introduces a multimodal energy-based model that generates realistic images from text, improves compositionality, enhances guidance in generative frameworks, and offers a robust evaluation metric for text-to-image tasks.

Contribution

The paper extends joint energy models to the multimodal domain using CLIP, integrating generative and discriminative objectives for improved text-to-image generation.

Findings

01

Generates realistic images from text descriptions.

02

Outperforms existing methods on compositionality benchmarks.

03

Serves as a robust evaluation metric for text-to-image models.

Abstract

Joint Energy Models (JEMs), while drawing significant research attention, have not been successfully scaled to real-world, high-resolution datasets. We present CLIP-JEM, a novel approach extending JEMs to the multimodal vision-language domain using CLIP, integrating both generative and discriminative objectives. For the generative one, we introduce an image-text joint-energy function based on Cosine similarity in the CLIP space, training CLIP to assign low energy to real image-caption pairs and high energy otherwise. For the discriminative one, we employ contrastive adversarial loss, extending the adversarial training objective to the multimodal domain. CLIP-JEM not only generates realistic images from text but also achieves competitive results on the compositionality benchmark, outperforming leading methods with fewer parameters. Additionally, we demonstrate the superior guidance…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVideo Analysis and Summarization

MethodsDiffusion · Contrastive Language-Image Pre-training