Bag of Design Choices for Inference of High-Resolution Masked Generative Transformer
Shitong Shao, Zikai Zhou, Tian Ye, Lichen Bai, Zhiqiang Xu, and Zeke Xie

TL;DR
This paper investigates and optimizes inference techniques for masked generative Transformers in high-resolution image generation, providing practical design choices that improve performance and sampling efficiency.
Contribution
It introduces and analyzes specific inference strategies for MGT, filling a gap in existing research, and demonstrates their effectiveness through extensive experiments.
Findings
Enhanced inference techniques improve sampling quality.
Design choices lead to ~70% winning rate on HPS v2.
Sampling acceleration methods increase efficiency.
Abstract
Text-to-image diffusion models (DMs) develop at an unprecedented pace, supported by thorough theoretical exploration and empirical analysis. Unfortunately, the discrepancy between DMs and autoregressive models (ARMs) complicates the path toward achieving the goal of unified vision and language generation. Recently, the masked generative Transformer (MGT) serves as a promising intermediary between DM and ARM by predicting randomly masked image tokens (i.e., masked image modeling), combining the efficiency of DM with the discrete token nature of ARM. However, we find that the comprehensive analyses regarding the inference for MGT are virtually non-existent, and thus we aim to present positive design choices to fill this gap. We propose and redesign a set of enhanced inference techniques tailored for MGT, providing a detailed analysis of their performance. Additionally, we explore several…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsWelding Techniques and Residual Stresses
MethodsAttention Is All You Need · Adam · Residual Connection · Byte Pair Encoding · Linear Layer · Sparse Evolutionary Training · Absolute Position Encodings · Multi-Head Attention · Dense Connections · Label Smoothing
