Fast Inference of Visual Autoregressive Model with Adjacency-Adaptive Dynamical Draft Trees
Haodong Lei, Hongsong Wang, Xin Geng, Liang Wang, Pan Zhou

TL;DR
This paper introduces ADT-Tree, a dynamic draft tree method that adapts its structure based on image region complexity, significantly accelerating autoregressive image inference while maintaining quality.
Contribution
The paper proposes ADT-Tree, an adjacency-adaptive dynamic draft tree that adjusts its depth and width during inference based on local prediction difficulty, improving speed for visual autoregressive models.
Findings
Achieves over 3x speedup on MS-COCO and PartiPrompts datasets.
Seamlessly integrates with relaxed sampling methods like LANTERN.
Maintains high image quality despite acceleration.
Abstract
Autoregressive (AR) image models achieve diffusion-level quality but suffer from sequential inference, requiring approximately 2,000 steps for a 576x576 image. Speculative decoding with draft trees accelerates LLMs yet underperforms on visual AR models due to spatially varying token prediction difficulty. We identify a key obstacle in applying speculative decoding to visual AR models: inconsistent acceptance rates across draft trees due to varying prediction difficulties in different image regions. We propose Adjacency-Adaptive Dynamical Draft Trees (ADT-Tree), an adjacency-adaptive dynamic draft tree that dynamically adjusts draft tree depth and width by leveraging adjacent token states and prior acceptance rates. ADT-Tree initializes via horizontal adjacency, then refines depth/width via bisectional adaptation, yielding deeper trees in simple regions and wider trees in complex ones.…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
1. The paper is well-written, and the proposed idea is easy to follow. 2. The problem of accelerating visual autoregressive model inference is timely and relevant.
1. **Justification for Core Observations**: The paper's motivation rests on two observations: (1) token generation difficulty varies (allegedly easier for low-frequency regions and harder for high-frequency regions), and (2) this difficulty exhibits locality. However, both the novelty and the evidence for these claims could be further substantiated. * The observation that token difficulty varies seems to echo existing concepts. The dynamic tree drafting in EAGLE-2, for instance, is already desi
- The problem statement and the solution are inuitive and clear. - PEANUT effectively accelerates generation and can be combined with other acceleration methods.
- The experimental setup does not specify the inference engine or system configuration. This omission makes the reported gains difficult to interpret. Also, reporting the absolute wall-clock latency would be informative. - The evaluation suite omits more advanced benchmarks such as GenEval. Including such metrics would better substantiate quality under acceleration. - In the experiments section, LlamaGen is mentioned, but LlamaGen results are absent from the main tables and discussion. Presentin
The paper effectively demonstrates the advantages of an adaptive speculative decoding tree within the autoregressive (AR) image generation domain, highlighting how dynamic depth and width control can improve token efficiency. The authors’ explicit commitment to open-sourcing the code upon acceptance is commendable, as it will likely foster reproducibility and stimulate further research in AR acceleration.
The core idea of dynamically adjusting speculative decoding parameters is already explored in LLM works such as Cascade Drafting (Chen et al., 2024) and Medusa (Cai et al., 2024). The paper should better clarify how PEANUT goes beyond simply transferring these ideas to the visual domain. The experimental section lacks comparisons with the most recent AR acceleration methods (e.g., parallel speculative decoding, hybrid prefix caching), making it difficult to judge the relative performance gains.
- The paper is well written and easy to understand. - The method is simple and straightforward to implement. - It shows performance improvements over both EAGLE-2 and recent EAGLE-2 based methods(LANTERN)
- **Low performance** : PEANUT generally only shows acceleration at T=0 ( greedy decoding). However, as shown in Tab. 1,2 , T=0 sampling incurs a significant generation quality drops, and thus is not a standard sampling method in many AR image generation models. At normal T=1 sampling, while quality is maintained, PEANUT only shows a low speed-up of ~1.05x. - **Comparison with SJD** : A further issue is that SJD already achieves high acceptance rate (>2) at T=1, even without complex tree atten
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Advanced Image Processing Techniques · Cell Image Analysis Techniques
