Continuous Speculative Decoding for Autoregressive Image Generation
Zili Wang, Robert Zhang, Kun Ding, Qi Yang, Fei Li, Shiming Xiang

TL;DR
This paper introduces continuous speculative decoding, a novel method to accelerate autoregressive image generation models by over two times without sacrificing quality, inspired by techniques from large language models.
Contribution
It develops the first continuous speculative decoding framework, addressing key challenges with innovative alignment and sampling strategies, enabling faster image generation.
Findings
Achieves over 2x speedup in image generation
Maintains original quality of generated images
Reduces inference time significantly without model retraining
Abstract
Continuous visual autoregressive (AR) models have demonstrated promising performance in image generation. However, the heavy autoregressive inference burden imposes significant overhead. In Large Language Models (LLMs), speculative decoding has effectively accelerated discrete autoregressive inference. However, the absence of an analogous theory for continuous distributions precludes its use in accelerating continuous AR models. To fill this gap, this work presents continuous speculative decoding, and addresses challenges from: 1) low acceptance rate, caused by inconsistent output distribution between target and draft models, and 2) modified distribution without analytic expression, caused by complex integral. To address challenge 1), we propose denoising trajectory alignment and token pre-filling strategies. To address challenge 2), we introduce acceptance-rejection sampling algorithm…
Peer Reviews
Decision·Submitted to ICLR 2026
- The paper identifies a clear and timely problem since speculative decoding methods have so far been limited to discrete token spaces while modern autoregressive image models increasingly operate in continuous latent or diffusion spaces. - It provides a mathematically grounded extension of speculative decoding to continuous probability distributions with explicit acceptance conditions that ensure nearly lossless decoding. - It introduces practical techniques such as denoising trajectory align
- The theoretical formulation assumes Gaussian diffusion dynamics and well-aligned draft and target trajectories, but the robustness of these assumptions is not empirically tested under learned or non-Gaussian noise schedules. - The method is evaluated only on mid-scale research models such as MAR, xAR, and Harmon without examining scalability to larger systems or compatibility with other acceleration methods like grouped or relaxed speculative decoding. - The analysis of failure modes and sen
The paper’s strongest aspect is that it identifies and articulates a genuinely nontrivial gap between discrete speculative decoding and continuous, diffusion-based AR generation, and then proposes a concrete recipe to fill it. Recognizing that the core obstacle is distributional inconsistency is an original reframing, and it is precisely this reframing that motivates the two key ideas, denoising trajectory alignment and token pre-filling, rather than importing the discrete algorithm verbatim.
The acceptance ratio in Eq. (2) is written as if it used the marginal, but what is actually computed is the factorized reverse-diffusion path probability, i.e. a joint over a sampled trajectory. This is only equal to the desired marginal when you integrate out intermediate states, which the method does not do.Because the same approximation is applied to both p and q, the authors hope the error cancels, but that cancellation is only heuristic and depends critically on the two denoising paths bein
- It is the first to apply speculative decoding in continuous settings. - Discovers practical problems and handles them by proposing techniques to raise acceptance in practice. - The method is training-free, facilitating practicality in deployment.
- Most speedups occur at large verification batch sizes, whereas bsz=1 shows diminished speedups. For many interactive or small-batch generation workloads, the practical acceleration may be lower than the headline. - Quality preservation is assessed primarily with FID/IS. There is no evaluation of text-image faithfulness, such as CLIPScore or GenEval.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsImage Retrieval and Classification Techniques · Advanced Image and Video Retrieval Techniques · Generative Adversarial Networks and Image Synthesis
MethodsDiffusion
