CogView: Mastering Text-to-Image Generation via Transformers

Ming Ding; Zhuoyi Yang; Wenyi Hong; Wendi Zheng; Chang Zhou; Da Yin,; Junyang Lin; Xu Zou; Zhou Shao; Hongxia Yang; Jie Tang

arXiv:2105.13290·cs.CV·November 8, 2021·383 cites

CogView: Mastering Text-to-Image Generation via Transformers

Ming Ding, Zhuoyi Yang, Wenyi Hong, Wendi Zheng, Chang Zhou, Da Yin,, Junyang Lin, Xu Zou, Zhou Shao, Hongxia Yang, Jie Tang

PDF

Open Access 4 Repos 2 Models 1 Video

TL;DR

CogView is a large Transformer-based model that significantly advances text-to-image generation by integrating VQ-VAE tokenization, demonstrating superior performance and versatile fine-tuning for various applications.

Contribution

Introduces CogView, a 4-billion-parameter Transformer with VQ-VAE, achieving state-of-the-art results and offering new fine-tuning strategies for diverse downstream tasks.

Findings

01

Achieves state-of-the-art FID on MS COCO dataset

02

Outperforms previous GAN-based models and DALL-E

03

Demonstrates effective fine-tuning for multiple applications

Abstract

Text-to-Image generation in the general domain has long been an open problem, which requires both a powerful generative model and cross-modal understanding. We propose CogView, a 4-billion-parameter Transformer with VQ-VAE tokenizer to advance this problem. We also demonstrate the finetuning strategies for various downstream tasks, e.g. style learning, super-resolution, text-image ranking and fashion design, and methods to stabilize pretraining, e.g. eliminating NaN losses. CogView achieves the state-of-the-art FID on the blurred MS COCO dataset, outperforming previous GAN-based models and a recent similar work DALL-E.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

Videos

CogView: Mastering Text-to-Image Generation via Transformers· slideslive

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Advanced Image Processing Techniques · Video Analysis and Summarization

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Dropout · Dense Connections · Adam · VQ-VAE · Label Smoothing