Cross-Modal Contrastive Learning for Text-to-Image Generation

Han Zhang; Jing Yu Koh; Jason Baldridge; Honglak Lee; Yinfei Yang

arXiv:2101.04702·cs.CV·April 15, 2022·34 cites

Cross-Modal Contrastive Learning for Text-to-Image Generation

Han Zhang, Jing Yu Koh, Jason Baldridge, Honglak Lee, Yinfei Yang

PDF

Open Access 1 Repo

TL;DR

This paper introduces XMC-GAN, a cross-modal contrastive generative adversarial network that significantly improves the quality and semantic fidelity of text-to-image synthesis by maximizing mutual information between images and text.

Contribution

XMC-GAN employs contrastive losses and an attentional generator to enhance text-image correspondence, setting new benchmarks on multiple challenging datasets.

Findings

01

Improves FID from 24.70 to 9.33 on MS-COCO

02

Achieves higher human preference scores for image quality and alignment

03

Sets new state-of-the-art FID scores on Localized Narratives and Open Images datasets

Abstract

The output of text-to-image synthesis systems should be coherent, clear, photo-realistic scenes with high semantic fidelity to their conditioned text descriptions. Our Cross-Modal Contrastive Generative Adversarial Network (XMC-GAN) addresses this challenge by maximizing the mutual information between image and text. It does this via multiple contrastive losses which capture inter-modality and intra-modality correspondences. XMC-GAN uses an attentional self-modulation generator, which enforces strong text-image correspondence, and a contrastive discriminator, which acts as a critic as well as a feature encoder for contrastive learning. The quality of XMC-GAN's output is a major step up from previous models, as we show on three challenging datasets. On MS-COCO, not only does XMC-GAN improve state-of-the-art FID from 24.70 to 9.33, but--more importantly--people prefer XMC-GAN by 77.3 for…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

google-research/xmcgan_image_generation
jaxOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications · Advanced Vision and Imaging