Hawk: Leveraging Spatial Context for Faster Autoregressive Text-to-Image Generation
Zhi-Kai Chen, Jun-Peng Jiang, Han-Jia Ye, De-Chuan Zhan

TL;DR
Hawk is a novel method that leverages the spatial structure of images to accelerate autoregressive text-to-image generation, achieving significant speedups while maintaining quality.
Contribution
Hawk introduces a spatially-aware speculative decoding approach that improves the efficiency of autoregressive image generation models.
Findings
Achieves 1.71x speedup over standard AR models
Maintains high image fidelity and diversity
Effectively utilizes spatial structure for faster generation
Abstract
Autoregressive (AR) image generation models are capable of producing high-fidelity images but often suffer from slow inference due to their inherently sequential, token-by-token decoding process. Speculative decoding, which employs a lightweight draft model to approximate the output of a larger AR model, has shown promise in accelerating text generation without compromising quality. However, its application to image generation remains largely underexplored. The challenges stem from a significantly larger sampling space, which complicates the alignment between the draft and target model outputs, coupled with the inadequate use of the two-dimensional spatial structure inherent in images, thereby limiting the modeling of local dependencies. To overcome these challenges, we introduce Hawk, a new approach that harnesses the spatial structure of images to guide the speculative model toward…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
