HMAR: Efficient Hierarchical Masked Auto-Regressive Image Generation

Hermann Kumbong; Xian Liu; Tsung-Yi Lin; Ming-Yu Liu; Xihui Liu; Ziwei Liu; Daniel Y. Fu; Christopher R\'e; David W. Romero

arXiv:2506.04421·cs.CV·June 6, 2025

HMAR: Efficient Hierarchical Masked Auto-Regressive Image Generation

Hermann Kumbong, Xian Liu, Tsung-Yi Lin, Ming-Yu Liu, Xihui Liu, Ziwei Liu, Daniel Y. Fu, Christopher R\'e, David W. Romero

PDF

Open Access 1 Models

TL;DR

HMAR introduces a hierarchical masked auto-regressive approach for image generation that improves quality, speed, and flexibility over existing autoregressive and diffusion models by using next-scale prediction and masked token generation.

Contribution

The paper proposes HMAR, a novel hierarchical autoregressive model that enhances image generation by reducing sequence length, enabling flexible sampling schedules, and improving efficiency and quality.

Findings

01

HMAR matches or outperforms existing models on ImageNet benchmarks.

02

HMAR achieves over 2.5x faster training and 1.75x faster inference.

03

HMAR reduces inference memory footprint by over 3x.

Abstract

Visual Auto-Regressive modeling (VAR) has shown promise in bridging the speed and quality gap between autoregressive image models and diffusion models. VAR reformulates autoregressive modeling by decomposing an image into successive resolution scales. During inference, an image is generated by predicting all the tokens in the next (higher-resolution) scale, conditioned on all tokens in all previous (lower-resolution) scales. However, this formulation suffers from reduced image quality due to the parallel generation of all tokens in a resolution scale; has sequence lengths scaling superlinearly in image resolution; and requires retraining to change the sampling schedule. We introduce Hierarchical Masked Auto-Regressive modeling (HMAR), a new image generation algorithm that alleviates these issues using next-scale prediction and masked prediction to generate high-quality images with…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
nvidia/HMAR
model· ♡ 3
♡ 3

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Domain Adaptation and Few-Shot Learning · Cell Image Analysis Techniques

MethodsDiffusion · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings