Vector-quantized Image Modeling with Improved VQGAN
Jiahui Yu, Xin Li, Jing Yu Koh, Han Zhang, Ruoming Pang, James Qin,, Alexander Ku, Yuanzhong Xu, Jason Baldridge, Yonghui Wu

TL;DR
This paper introduces an improved VQGAN-based image modeling approach that pretrains a Transformer on vector-quantized image tokens, achieving state-of-the-art results in image generation and representation learning.
Contribution
The paper proposes multiple architectural and training improvements to VQGAN, enhancing image reconstruction and modeling capabilities, and demonstrates superior performance on ImageNet.
Findings
Achieves higher Inception Score and lower FID on ImageNet compared to vanilla VQGAN.
Pretrained Transformer outperforms iGPT in linear-probe accuracy.
Improved model surpasses larger models trained with more data.
Abstract
Pretraining language models with next-token prediction on massive text corpora has delivered phenomenal zero-shot, few-shot, transfer learning and multi-tasking capabilities on both generative and discriminative language tasks. Motivated by this success, we explore a Vector-quantized Image Modeling (VIM) approach that involves pretraining a Transformer to predict rasterized image tokens autoregressively. The discrete image tokens are encoded from a learned Vision-Transformer-based VQGAN (ViT-VQGAN). We first propose multiple improvements over vanilla VQGAN from architecture to codebook learning, yielding better efficiency and reconstruction fidelity. The improved ViT-VQGAN further improves vector-quantized image modeling tasks, including unconditional, class-conditioned image generation and unsupervised representation learning. When trained on ImageNet at \(256\times256\) resolution, we…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsNatural Language Processing Techniques · Multimodal Machine Learning Applications · Topic Modeling
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Cosine Annealing · Refunds@Expedia|||How do I get a full refund from Expedia? · Weight Decay · Residual Connection · Discriminative Fine-Tuning · Absolute Position Encodings · Adam
