DiSA: Diffusion Step Annealing in Autoregressive Image Generation
Qinyu Zhao, Jaskirat Singh, Ming Xu, Akshay Asthana, Stephen Gould, Liang Zheng

TL;DR
This paper introduces diffusion step annealing (DiSA), a training-free method that reduces diffusion steps during autoregressive image generation, significantly speeding up inference while preserving quality.
Contribution
DiSA leverages the insight that later tokens are easier to predict, gradually decreasing diffusion steps during generation, which is a simple yet effective acceleration technique.
Findings
Achieves 5-10x faster inference on MAR and Harmon models.
Maintains generation quality despite reduced diffusion steps.
Applicable to multiple autoregressive diffusion models.
Abstract
An increasing number of autoregressive models, such as MAR, FlowAR, xAR, and Harmon adopt diffusion sampling to improve the quality of image generation. However, this strategy leads to low inference efficiency, because it usually takes 50 to 100 steps for diffusion to sample a token. This paper explores how to effectively address this issue. Our key motivation is that as more tokens are generated during the autoregressive process, subsequent tokens follow more constrained distributions and are easier to sample. To intuitively explain, if a model has generated part of a dog, the remaining tokens must complete the dog and thus are more constrained. Empirical evidence supports our motivation: at later generation stages, the next tokens can be well predicted by a multilayer perceptron, exhibit low variance, and follow closer-to-straight-line denoising paths from noise to tokens. Based on…
Peer Reviews
Decision·Submitted to ICLR 2026
DiSA introduces a new interpretation of diffusion dynamics within AR generation. As conditioning strengthens across timesteps, the diffusion process becomes inherently easier. Unlike prior accelerators (e.g., DDIM, DPM-Solver, LazyMAR), which assume uniform difficulty and globally reduce steps, DiSA models the heterogeneity of diffusion necessity over AR progression. This observation is theoretically supported through denoising-path straightness analysis, linking DiSA to the geometry of diffusio
While DiSA is empirically well-justified, it remains largely heuristic. The diffusion-step schedule is fixed (typically linear), without a principled derivation from diffusion dynamics or uncertainty theory. In contrast, prior works like AdaDiff or Rectified Flow introduce adaptive step sizes based on explicit error or confidence estimation. DiSA assumes the AR step index monotonically correlates with conditional strength, an assumption not guaranteed for complex prompts. A theoretical analysis
* Clear empirical insight (straighter late‑stage denoising) turned into a simple, general sampler schedule that’s easy to be equipped with various AR + diffusion architectures (Fig. 1) * Strong evaluation: various image‑level metrics (such as FID/IS/Precision/Recall), per‑image time, and complements existing diffusion accelerators (Table 3). * Practical wins on both ImageNet 256 x 256 and T2I GenEval (Harmon) with concrete speed–quality curves (Fig. 5).
I'm not an expert in this area, but I have some concerns and questions based on my understanding. * About novelty I appreciate the practical acceleration idea, but the paper mainly relies on the diffusion-step annealing strategy without much theoretical/mathematical evidence. Even a bit of math or intuition on why this schedule makes sense would make the work solid. * Scheduler robustness How sensitive is performance to the exact annealing schedule (e.g., 50 -> 5)? Could the authors provide
1. The paper presents a convincing empirical study demonstrating that later AR steps have more constrained distributions. 2. Experiment results demonstrate consistent speed-ups across four major AR-diffusion models (MAR, FlowAR, xAR, Harmon) with minimal loss in quality. 3. The paper is well-written, logically structured, concise, and clear, making it easy for readers to understand.
1. The annealing schedule (linear, cosine, two-stage) and the choice of T_early, T_late are not extensively analyzed. The robustness of these settings across datasets and models could be better demonstrated. 2. The idea of step annealing has precedents in pure diffusion models (e.g., DDIM, DPM-Solver). The novelty here lies in transferring and validating this principle within autoregressive-diffusion frameworks.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsImage Retrieval and Classification Techniques · Advanced Image and Video Retrieval Techniques · Medical Image Segmentation Techniques
MethodsADaptive gradient method with the OPTimal convergence rate · Diffusion
