Hawk: Leveraging Spatial Context for Faster Autoregressive Text-to-Image Generation

Zhi-Kai Chen; Jun-Peng Jiang; Han-Jia Ye; De-Chuan Zhan

arXiv:2510.25739·cs.CV·October 30, 2025

Hawk: Leveraging Spatial Context for Faster Autoregressive Text-to-Image Generation

Zhi-Kai Chen, Jun-Peng Jiang, Han-Jia Ye, De-Chuan Zhan

PDF

TL;DR

Hawk is a novel method that leverages the spatial structure of images to accelerate autoregressive text-to-image generation, achieving significant speedups while maintaining quality.

Contribution

Hawk introduces a spatially-aware speculative decoding approach that improves the efficiency of autoregressive image generation models.

Findings

01

Achieves 1.71x speedup over standard AR models

02

Maintains high image fidelity and diversity

03

Effectively utilizes spatial structure for faster generation

Abstract

Autoregressive (AR) image generation models are capable of producing high-fidelity images but often suffer from slow inference due to their inherently sequential, token-by-token decoding process. Speculative decoding, which employs a lightweight draft model to approximate the output of a larger AR model, has shown promise in accelerating text generation without compromising quality. However, its application to image generation remains largely underexplored. The challenges stem from a significantly larger sampling space, which complicates the alignment between the draft and target model outputs, coupled with the inadequate use of the two-dimensional spatial structure inherent in images, thereby limiting the modeling of local dependencies. To overcome these challenges, we introduce Hawk, a new approach that harnesses the spatial structure of images to guide the speculative model toward…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.