Hierarchical Masked Autoregressive Models with Low-Resolution Token Pivots

Guangting Zheng; Yehao Li; Yingwei Pan; Jiajun Deng; Ting Yao; Yanyong Zhang; Tao Mei

arXiv:2505.20288·cs.CV·May 27, 2025

Hierarchical Masked Autoregressive Models with Low-Resolution Token Pivots

Guangting Zheng, Yehao Li, Yingwei Pan, Jiajun Deng, Ting Yao, Yanyong Zhang, Tao Mei

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper introduces Hi-MAR, a hierarchical autoregressive model that uses low-resolution image tokens as pivots to improve global context understanding and generation quality in visual tasks, with reduced computational costs.

Contribution

It proposes a novel hierarchical autoregressive framework with low-resolution pivots and a diffusion transformer head, enhancing global structure modeling in image generation.

Findings

01

Outperforms typical AR baselines in quality

02

Requires fewer computational resources

03

Effective in class-conditional and text-to-image tasks

Abstract

Autoregressive models have emerged as a powerful generative paradigm for visual generation. The current de-facto standard of next token prediction commonly operates over a single-scale sequence of dense image tokens, and is incapable of utilizing global context especially for early tokens prediction. In this paper, we introduce a new autoregressive design to model a hierarchy from a few low-resolution image tokens to the typical dense image tokens, and delve into a thorough hierarchical dependency across multi-scale image tokens. Technically, we present a Hierarchical Masked Autoregressive models (Hi-MAR) that pivot on low-resolution image tokens to trigger hierarchical autoregressive modeling in a multi-phase manner. Hi-MAR learns to predict a few image tokens in low resolution, functioning as intermediary pivots to reflect global structure, in the first phase. Such pivots act as the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

hidream-ai/himar
pytorchOfficial

Videos

Hierarchical Masked Autoregressive Models with Low-Resolution Token Pivots· slideslive

Taxonomy

TopicsAdvanced Vision and Imaging

MethodsAttention Is All You Need · Linear Layer · Layer Normalization · Byte Pair Encoding · Residual Connection · Dense Connections · Softmax · Diffusion · Position-Wise Feed-Forward Layer · Absolute Position Encodings