Locality-aware Parallel Decoding for Efficient Autoregressive Image Generation
Zhuoyang Zhang, Luke J. Huang, Chengyue Wu, Shang Yang, Kelly Peng, Yao Lu, Song Han

TL;DR
This paper introduces Locality-aware Parallel Decoding (LPD), a novel method that significantly accelerates autoregressive image generation by enabling flexible, high-parallelism generation orders while maintaining quality.
Contribution
It proposes a new architecture and scheduling technique that allow arbitrary parallel generation with minimized dependencies, reducing steps and latency without quality loss.
Findings
Reduced generation steps from 256 to 20 for 256x256 images.
Achieved at least 3.4x lower latency compared to previous methods.
Maintained generation quality on ImageNet benchmarks.
Abstract
We present Locality-aware Parallel Decoding (LPD) to accelerate autoregressive image generation. Traditional autoregressive image generation relies on next-patch prediction, a memory-bound process that leads to high latency. Existing works have tried to parallelize next-patch prediction by shifting to multi-patch prediction to accelerate the process, but only achieved limited parallelization. To achieve high parallelization while maintaining generation quality, we introduce two key techniques: (1) Flexible Parallelized Autoregressive Modeling, a novel architecture that enables arbitrary generation ordering and degrees of parallelization. It uses learnable position query tokens to guide generation at target positions while ensuring mutual visibility among concurrently generated tokens for consistent parallel decoding. (2) Locality-aware Generation Ordering, a novel schedule that forms…
Peer Reviews
Decision·ICLR 2026 Oral
* Sound Method Design:The learnable position query tokens decouple context modeling from decoding, enabling generation at arbitrary target positions and boosting flexibility. The exploration of two locality principles, particularly the second one, offers meaningful insights for the community. * Strong Performance:The method achieves clear reductions in generation steps and latency with good quality.
1. Overclaimed Contributions in Writing - For "Flexible Parallelized Autoregressive Modeling", decoder-only works like PAR/ZipAR/NAR already treat previously decoded tokens as KV Cache and the queries are decoded in parallel ensuring the mutual visibility among tokens generated concurrently; the key difference lies only in LPD’s position query tokens (enabling arbitrary target positions), which should be clarified to avoid overstating contributions. - For "Locality-aware Generation Orderi
The paper conducts a genuinely deep analysis of the key factors that affect both performance and generation quality in parallel autoregressive decoding (e.g., group size, dependency structure, attention visibility), and then turns those observations into a coherent, end-to-end parallelization method rather than a single heuristic component. Through careful comparisons with recent parallel AR implementations (e.g., encoder–decoder style SAR/ARPG and decoder-only RANDAR), the authors show that th
Experiments are limited to image generation; since current AR models are increasingly used for multimodal I/O (image–text, video tokens, layout, even audio tokens), it would strengthen the claim of “general AR parallelization” to show at least one non-image setting (e.g., CLIP-conditioned image tokens, image+text joint decoding, or video latents). The paper does not compare against the newest AR acceleration lines such as speculative decoding, speculative Jacobi-style decoding, or draft/verify
1. The proposed Flexible Parallelized Autoregressive Modeling overcomes the constraint of a fixed generation order by allowing images to be synthesized in an arbitrary sequence. This capability holds the potential for discovering more effective generation orders in the future. 2. When equipped with the proposed Locality-aware Generation Ordering strategy, LPD demonstrates improved FID scores and greater generation efficiency on the ImageNet dataset. 3. The paper is easy to read and the figures a
1. The paper's core algorithm (Algorithm 1) is presented in the appendix. While space constraints are understandable, the most critical algorithm should ideally be included in the main text, or at the very least, its underlying principles should be explained there. 2. The computational cost of the model increases compared to traditional fixed-order autoregressive models due to the use of additional positional query tokens. However, an analysis of this overhead is absent from the paper. 3. In lin
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Face recognition and analysis · Advanced Neural Network Applications
