Vector-quantized Image Modeling with Improved VQGAN

Jiahui Yu; Xin Li; Jing Yu Koh; Han Zhang; Ruoming Pang; James Qin,; Alexander Ku; Yuanzhong Xu; Jason Baldridge; Yonghui Wu

arXiv:2110.04627·cs.CV·June 7, 2022·92 cites

Vector-quantized Image Modeling with Improved VQGAN

Jiahui Yu, Xin Li, Jing Yu Koh, Han Zhang, Ruoming Pang, James Qin,, Alexander Ku, Yuanzhong Xu, Jason Baldridge, Yonghui Wu

PDF

Open Access 5 Repos 1 Video

TL;DR

This paper introduces an improved VQGAN-based image modeling approach that pretrains a Transformer on vector-quantized image tokens, achieving state-of-the-art results in image generation and representation learning.

Contribution

The paper proposes multiple architectural and training improvements to VQGAN, enhancing image reconstruction and modeling capabilities, and demonstrates superior performance on ImageNet.

Findings

01

Achieves higher Inception Score and lower FID on ImageNet compared to vanilla VQGAN.

02

Pretrained Transformer outperforms iGPT in linear-probe accuracy.

03

Improved model surpasses larger models trained with more data.

Abstract

Pretraining language models with next-token prediction on massive text corpora has delivered phenomenal zero-shot, few-shot, transfer learning and multi-tasking capabilities on both generative and discriminative language tasks. Motivated by this success, we explore a Vector-quantized Image Modeling (VIM) approach that involves pretraining a Transformer to predict rasterized image tokens autoregressively. The discrete image tokens are encoded from a learned Vision-Transformer-based VQGAN (ViT-VQGAN). We first propose multiple improvements over vanilla VQGAN from architecture to codebook learning, yielding better efficiency and reconstruction fidelity. The improved ViT-VQGAN further improves vector-quantized image modeling tasks, including unconditional, class-conditioned image generation and unsupervised representation learning. When trained on ImageNet at \(256\times256\) resolution, we…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

Vector-quantized Image Modeling with Improved VQGAN· slideslive

Taxonomy

TopicsNatural Language Processing Techniques · Multimodal Machine Learning Applications · Topic Modeling

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Cosine Annealing · Refunds@Expedia|||How do I get a full refund from Expedia? · Weight Decay · Residual Connection · Discriminative Fine-Tuning · Absolute Position Encodings · Adam