TL;DR
This paper introduces a hybrid approach combining CNNs and transformers to efficiently generate high-resolution, semantically-guided images, achieving state-of-the-art results in class-conditional ImageNet synthesis.
Contribution
It presents a novel method that leverages CNNs for local feature extraction and transformers for modeling global image composition, enabling high-resolution image synthesis.
Findings
Achieved state-of-the-art results on class-conditional ImageNet synthesis.
First to demonstrate semantically-guided megapixel image generation with transformers.
Effective combination of CNN inductive bias with transformer expressivity.
Abstract
Designed to learn long-range interactions on sequential data, transformers continue to show state-of-the-art results on a wide variety of tasks. In contrast to CNNs, they contain no inductive bias that prioritizes local interactions. This makes them expressive, but also computationally infeasible for long sequences, such as high-resolution images. We demonstrate how combining the effectiveness of the inductive bias of CNNs with the expressivity of transformers enables them to model and thereby synthesize high-resolution images. We show how to (i) use CNNs to learn a context-rich vocabulary of image constituents, and in turn (ii) utilize transformers to efficiently model their composition within high-resolution images. Our approach is readily applied to conditional synthesis tasks, where both non-spatial information, such as object classes, and spatial information, such as segmentations,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗dalle-mini/dalle-minimodel· 192 dl· ♡ 396192 dl♡ 396
- 🤗caio13/dalle-monomodel
- 🤗AlexKM/vqgan-clpmodel· ♡ 2♡ 2
- 🤗airsat/dalle-minimodel· 9 dl· ♡ 19 dl♡ 1
- 🤗igotech/text2imagemodel
- 🤗microsoft/radeditmodel· ♡ 29♡ 29
- 🤗markweber/taming_vqganmodel
- 🤗matrix11404/bopts_newsmodel· 1 dl1 dl
- 🤗MCA61/VQGANmodel
- 🤗raw9/dalle-minimodel
Videos
VQ-GAN: Taming Transformers for High-Resolution Image Synthesis | Paper Explained· youtube
Taxonomy
MethodsLinear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Residual Connection · Label Smoothing · Dropout · Byte Pair Encoding · Adam · Dense Connections · Softmax
