Speculative Decoding for Autoregressive Video Generation
Yuezhou Hu, Jintao Zhang

TL;DR
This paper introduces SDVG, a speculative decoding method for autoregressive video diffusion that accelerates inference by using an image-quality router, achieving significant speedups with minimal quality loss.
Contribution
SDVG adapts speculative decoding to autoregressive video generation using an image-quality router, enabling faster inference without architectural changes.
Findings
SDVG retains 98.1% of target quality at 1.59x speedup.
SDVG achieves up to 2.09x speedup with 95.7% quality retention.
SDVG outperforms draft-only generation by over 17%.
Abstract
Autoregressive video diffusion is emerging as a promising paradigm for streaming video synthesis, with step distillation serving as the primary means of accelerating inference. Whether speculative decoding, the dominant acceleration strategy for large language models, can be effectively adapted to autoregressive video generation remains an open question, because video blocks are continuous spatiotemporal tensors with no token-level distribution for exact rejection sampling. We introduce SDVG, which brings speculative decoding to block-based autoregressive video diffusion by replacing token verification with an image-quality router. A 1.3B drafter proposes candidate blocks via four denoising steps; each block is VAE-decoded and scored by ImageReward using worst-frame aggregation--taking the minimum per-frame reward to catch single-frame artifacts that averaging would mask. Blocks scoring…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
