DiverGAN: An Efficient and Effective Single-Stage Framework for Diverse   Text-to-Image Generation

Zhenxing Zhang; Lambert Schomaker

arXiv:2111.09267·cs.CV·May 10, 2022

DiverGAN: An Efficient and Effective Single-Stage Framework for Diverse Text-to-Image Generation

Zhenxing Zhang, Lambert Schomaker

PDF

TL;DR

DiverGAN is a single-stage framework for diverse, high-quality, and semantically consistent text-to-image generation, utilizing novel attention modules, adaptive normalization, and a dual-residual structure to enhance diversity and stability.

Contribution

The paper introduces DiverGAN, a novel single-stage text-to-image model with word-level attention, adaptive normalization, and a dual-residual structure, improving diversity, quality, and training stability.

Findings

01

Outperforms existing models in diversity and quality metrics.

02

Achieves faster convergence and more vivid details.

03

Effectively maintains semantic consistency in generated images.

Abstract

In this paper, we present an efficient and effective single-stage framework (DiverGAN) to generate diverse, plausible and semantically consistent images according to a natural-language description. DiverGAN adopts two novel word-level attention modules, i.e., a channel-attention module (CAM) and a pixel-attention module (PAM), which model the importance of each word in the given sentence while allowing the network to assign larger weights to the significant channels and pixels semantically aligning with the salient words. After that, Conditional Adaptive Instance-Layer Normalization (CAdaILN) is introduced to enable the linguistic cues from the sentence embedding to flexibly manipulate the amount of change in shape and texture, further improving visual-semantic representation and helping stabilize the training. Also, a dual-residual structure is developed to preserve more original…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · Convolution · *Communicated@Fast*How Do I Communicate to Expedia? · Residual Connection · Linear Layer · Batch Normalization · Residual Block