Text-to-Image Generation Via Energy-Based CLIP
Roy Ganz, Michael Elad

TL;DR
CLIP-JEM introduces a multimodal energy-based model that generates realistic images from text, improves compositionality, enhances guidance in generative frameworks, and offers a robust evaluation metric for text-to-image tasks.
Contribution
The paper extends joint energy models to the multimodal domain using CLIP, integrating generative and discriminative objectives for improved text-to-image generation.
Findings
Generates realistic images from text descriptions.
Outperforms existing methods on compositionality benchmarks.
Serves as a robust evaluation metric for text-to-image models.
Abstract
Joint Energy Models (JEMs), while drawing significant research attention, have not been successfully scaled to real-world, high-resolution datasets. We present CLIP-JEM, a novel approach extending JEMs to the multimodal vision-language domain using CLIP, integrating both generative and discriminative objectives. For the generative one, we introduce an image-text joint-energy function based on Cosine similarity in the CLIP space, training CLIP to assign low energy to real image-caption pairs and high energy otherwise. For the discriminative one, we employ contrastive adversarial loss, extending the adversarial training objective to the multimodal domain. CLIP-JEM not only generates realistic images from text but also achieves competitive results on the compositionality benchmark, outperforming leading methods with fewer parameters. Additionally, we demonstrate the superior guidance…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVideo Analysis and Summarization
MethodsDiffusion · Contrastive Language-Image Pre-training
