ENAT: Rethinking Spatial-temporal Interactions in Token-based Image Synthesis
Zanlin Ni, Yulin Wang, Renping Zhou, Yizeng Han, Jiayi Guo, Zhiyuan, Liu, Yuan Yao, Gao Huang

TL;DR
This paper introduces ENAT, a novel token-based image synthesis model that leverages insights into spatial and temporal interactions in NATs to improve efficiency and performance in image generation tasks.
Contribution
ENAT explicitly models critical token interactions in NATs, disentangling spatial and temporal computations to enhance efficiency and image synthesis quality.
Findings
ENAT achieves better performance on ImageNet and MS-COCO datasets.
ENAT significantly reduces computational costs compared to traditional NATs.
Experimental results validate the effectiveness of the proposed approach.
Abstract
Recently, token-based generation have demonstrated their effectiveness in image synthesis. As a representative example, non-autoregressive Transformers (NATs) can generate decent-quality images in a few steps. NATs perform generation in a progressive manner, where the latent tokens of a resulting image are incrementally revealed. At each step, the unrevealed image regions are padded with mask tokens and inferred by NAT. In this paper, we delve into the mechanisms behind the effectiveness of NATs and uncover two important patterns that naturally emerge from NATs: Spatially (within a step), although mask and visible tokens are processed uniformly by NATs, the interactions between them are highly asymmetric. In specific, mask tokens mainly gather information for decoding, while visible tokens tend to primarily provide information, and their deep representations can be built only upon…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsComputer Graphics and Visualization Techniques · Advanced Image and Video Retrieval Techniques · Generative Adversarial Networks and Image Synthesis
