Improving Visual Quality of Image Synthesis by A Token-based Generator with Transformers
Yanhong Zeng, Huan Yang, Hongyang Chao, Jianbo Wang, Jianlong Fu

TL;DR
This paper introduces TokenGAN, a transformer-based generator that models image synthesis as visual token generation, enabling fine-grained, content-aware style control and achieving state-of-the-art results on high-resolution benchmarks.
Contribution
It proposes a novel token-based generator framework using transformers for flexible, fine-grained image synthesis and style control, surpassing existing methods in quality and resolution.
Findings
Achieved state-of-the-art results on FFHQ and LSUN benchmarks.
Synthesized high-fidelity images up to 1024x1024 resolution.
Dispensed with convolutions entirely in high-resolution image synthesis.
Abstract
We present a new perspective of achieving image synthesis by viewing this task as a visual token generation problem. Different from existing paradigms that directly synthesize a full image from a single input (e.g., a latent code), the new formulation enables a flexible local manipulation for different image regions, which makes it possible to learn content-aware and fine-grained style control for image synthesis. Specifically, it takes as input a sequence of latent tokens to predict the visual tokens for synthesizing an image. Under this perspective, we propose a token-based generator (i.e.,TokenGAN). Particularly, the TokenGAN inputs two semantically different visual tokens, i.e., the learned constant content tokens and the style tokens from the latent space. Given a sequence of style tokens, the TokenGAN is able to control the image synthesis by assigning the styles to the content…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Computer Graphics and Visualization Techniques · Advanced Image and Video Retrieval Techniques
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Layer Normalization · Adam · Dropout · Residual Connection · Dense Connections · Absolute Position Encodings · Byte Pair Encoding
