Cross-Modal Contrastive Learning for Text-to-Image Generation
Han Zhang, Jing Yu Koh, Jason Baldridge, Honglak Lee, Yinfei Yang

TL;DR
This paper introduces XMC-GAN, a cross-modal contrastive generative adversarial network that significantly improves the quality and semantic fidelity of text-to-image synthesis by maximizing mutual information between images and text.
Contribution
XMC-GAN employs contrastive losses and an attentional generator to enhance text-image correspondence, setting new benchmarks on multiple challenging datasets.
Findings
Improves FID from 24.70 to 9.33 on MS-COCO
Achieves higher human preference scores for image quality and alignment
Sets new state-of-the-art FID scores on Localized Narratives and Open Images datasets
Abstract
The output of text-to-image synthesis systems should be coherent, clear, photo-realistic scenes with high semantic fidelity to their conditioned text descriptions. Our Cross-Modal Contrastive Generative Adversarial Network (XMC-GAN) addresses this challenge by maximizing the mutual information between image and text. It does this via multiple contrastive losses which capture inter-modality and intra-modality correspondences. XMC-GAN uses an attentional self-modulation generator, which enforces strong text-image correspondence, and a contrastive discriminator, which acts as a critic as well as a feature encoder for contrastive learning. The quality of XMC-GAN's output is a major step up from previous models, as we show on three challenging datasets. On MS-COCO, not only does XMC-GAN improve state-of-the-art FID from 24.70 to 9.33, but--more importantly--people prefer XMC-GAN by 77.3 for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications · Advanced Vision and Imaging
