Hita: Holistic Tokenizer for Autoregressive Image Generation

Anlin Zheng; Haochen Wang; Yucheng Zhao; Weipeng Deng; Tiancai Wang; Xiangyu Zhang; Xiaojuan Qi

arXiv:2507.02358·cs.CV·July 14, 2025

Hita: Holistic Tokenizer for Autoregressive Image Generation

Anlin Zheng, Haochen Wang, Yucheng Zhao, Weipeng Deng, Tiancai Wang, Xiangyu Zhang, Xiaojuan Qi

PDF

TL;DR

Hita is a new holistic image tokenizer designed for autoregressive models, improving global understanding, training speed, and image quality in tasks like generation, style transfer, and in-painting.

Contribution

Hita introduces a holistic-to-local tokenization scheme with learnable holistic queries and a sequential structure, enhancing autoregressive image generation.

Findings

01

Achieves 2.59 FID and 281.9 IS on ImageNet

02

Accelerates training speed of AR generators

03

Effective in zero-shot style transfer and in-painting

Abstract

Vanilla autoregressive image generation models generate visual tokens step-by-step, limiting their ability to capture holistic relationships among token sequences. Moreover, because most visual tokenizers map local image patches into latent tokens, global information is limited. To address this, we introduce \textit{Hita}, a novel image tokenizer for autoregressive (AR) image generation. It introduces a holistic-to-local tokenization scheme with learnable holistic queries and local patch tokens. Hita incorporates two key strategies to better align with the AR generation process: 1) {arranging} a sequential structure with holistic tokens at the beginning, followed by patch-level tokens, and using causal attention to maintain awareness of previous tokens; and 2) adopting a lightweight fusion module before feeding the de-quantized tokens into the decoder to control information flow and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsALIGN · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings