Spec-VLA: Speculative Decoding for Vision-Language-Action Models with Relaxed Acceptance

Songsheng Wang; Rucheng Yu; Zhihang Yuan; Chao Yu; Feng Gao; Yu Wang; Derek F. Wong

arXiv:2507.22424·cs.LG·September 23, 2025

Spec-VLA: Speculative Decoding for Vision-Language-Action Models with Relaxed Acceptance

Songsheng Wang, Rucheng Yu, Zhihang Yuan, Chao Yu, Feng Gao, Yu Wang, Derek F. Wong

PDF

Open Access

TL;DR

This paper introduces Spec-VLA, a speculative decoding framework that accelerates vision-language-action models by relaxing acceptance criteria, achieving significant speedup without loss of accuracy.

Contribution

The paper proposes a novel speculative decoding method with relaxed acceptance for VLA models, enabling faster generation while maintaining performance.

Findings

01

Achieves 1.42x speedup over baseline

02

Increases acceptance length by 44%

03

Maintains success rate with faster decoding

Abstract

Vision-Language-Action (VLA) models have made substantial progress by leveraging the robust capabilities of Visual Language Models (VLMs). However, VLMs' significant parameter size and autoregressive (AR) decoding nature impose considerable computational demands on VLA models. While Speculative Decoding (SD) has shown efficacy in accelerating Large Language Models (LLMs) by incorporating efficient drafting and parallel verification, allowing multiple tokens to be generated in one forward pass, its application to VLA models remains unexplored. This work introduces Spec-VLA, an SD framework designed to accelerate VLA models. Due to the difficulty of the action prediction task and the greedy decoding mechanism of the VLA models, the direct application of the advanced SD framework to the VLA prediction task yields a minor speed improvement. To boost the generation speed, we propose an…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Natural Language Processing Techniques