Resurrect Mask AutoRegressive Modeling for Efficient and Scalable Image Generation

Yi Xin; Le Zhuo; Qi Qin; Siqi Luo; Yuewen Cao; Bin Fu; Yangfan He; Hongsheng Li; Guangtao Zhai; Xiaohong Liu; Peng Gao

arXiv:2507.13032·cs.CV·July 18, 2025

Resurrect Mask AutoRegressive Modeling for Efficient and Scalable Image Generation

Yi Xin, Le Zhuo, Qi Qin, Siqi Luo, Yuewen Cao, Bin Fu, Yangfan He, Hongsheng Li, Guangtao Zhai, Xiaohong Liu, Peng Gao

PDF

Open Access

TL;DR

This paper introduces MaskGIL, an improved Masked AutoRegressive model for image generation that achieves state-of-the-art quality with significantly fewer inference steps, and extends to text-to-image and speech-to-image tasks.

Contribution

The paper refines MAR architecture with bidirectional attention and 2D RoPE, achieving high-quality image generation with fewer steps and extending capabilities to text and speech inputs.

Findings

01

MaskGIL matches state-of-the-art AR models in FID score.

02

Requires only 8 inference steps compared to 256 for traditional AR models.

03

Extends to text-driven and speech-to-image generation.

Abstract

AutoRegressive (AR) models have made notable progress in image generation, with Masked AutoRegressive (MAR) models gaining attention for their efficient parallel decoding. However, MAR models have traditionally underperformed when compared to standard AR models. This study refines the MAR architecture to improve image generation quality. We begin by evaluating various image tokenizers to identify the most effective one. Subsequently, we introduce an improved Bidirectional LLaMA architecture by replacing causal attention with bidirectional attention and incorporating 2D RoPE, which together form our advanced model, MaskGIL. Scaled from 111M to 1.4B parameters, MaskGIL achieves a FID score of 3.71, matching state-of-the-art AR models in the ImageNet 256x256 benchmark, while requiring only 8 inference steps compared to the 256 steps of AR models. Furthermore, we develop a text-driven…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Advanced Image and Video Retrieval Techniques · Industrial Vision Systems and Defect Detection