TL;DR
VVS introduces a novel speculative decoding framework that significantly reduces inference latency in visual autoregressive models by skipping verification steps through partial verification skipping.
Contribution
It proposes a new SD framework with verification skipping, token selection, feature caching, and step scheduling to accelerate visual AR models without quality loss.
Findings
Reduces target model forward passes by 2.8×
Maintains competitive generation quality
Outperforms conventional SD frameworks in speed-quality trade-off
Abstract
Visual autoregressive (AR) generation models have demonstrated strong potential for image generation, yet their next-token-prediction paradigm introduces considerable inference latency. Although speculative decoding (SD) has been proven effective for accelerating visual AR models, its "draft one step, then verify one step" paradigm prevents a direct reduction in the number of forward passes, limiting its acceleration potential. Motivated by the interchangeability of visual tokens, we explore verification skipping in the SD process for the first time to explicitly cut the number of target model forward passes, thereby reducing inference latency. By analyzing the characteristics of the drafting stage, we observe that verification redundancy and stale feature reusability are key factors to maintain generation quality while improving speed for verification-free steps. Inspired by these two…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
