ENAT: Rethinking Spatial-temporal Interactions in Token-based Image   Synthesis

Zanlin Ni; Yulin Wang; Renping Zhou; Yizeng Han; Jiayi Guo; Zhiyuan; Liu; Yuan Yao; Gao Huang

arXiv:2411.06959·cs.CV·November 12, 2024

ENAT: Rethinking Spatial-temporal Interactions in Token-based Image Synthesis

Zanlin Ni, Yulin Wang, Renping Zhou, Yizeng Han, Jiayi Guo, Zhiyuan, Liu, Yuan Yao, Gao Huang

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper introduces ENAT, a novel token-based image synthesis model that leverages insights into spatial and temporal interactions in NATs to improve efficiency and performance in image generation tasks.

Contribution

ENAT explicitly models critical token interactions in NATs, disentangling spatial and temporal computations to enhance efficiency and image synthesis quality.

Findings

01

ENAT achieves better performance on ImageNet and MS-COCO datasets.

02

ENAT significantly reduces computational costs compared to traditional NATs.

03

Experimental results validate the effectiveness of the proposed approach.

Abstract

Recently, token-based generation have demonstrated their effectiveness in image synthesis. As a representative example, non-autoregressive Transformers (NATs) can generate decent-quality images in a few steps. NATs perform generation in a progressive manner, where the latent tokens of a resulting image are incrementally revealed. At each step, the unrevealed image regions are padded with mask tokens and inferred by NAT. In this paper, we delve into the mechanisms behind the effectiveness of NATs and uncover two important patterns that naturally emerge from NATs: Spatially (within a step), although mask and visible tokens are processed uniformly by NATs, the interactions between them are highly asymmetric. In specific, mask tokens mainly gather information for decoding, while visible tokens tend to primarily provide information, and their deep representations can be built only upon…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

leaplabthu/enat
pytorchOfficial

Videos

ENAT: Rethinking Spatial-temporal Interactions in Token-based Image Synthesis· slideslive

Taxonomy

TopicsComputer Graphics and Visualization Techniques · Advanced Image and Video Retrieval Techniques · Generative Adversarial Networks and Image Synthesis