FastVLM: Self-Speculative Decoding for Fast Vision-Language Model Inference

Divya Jyoti Bajpai; Manjesh Kumar Hanawal

arXiv:2510.22641·cs.LG·October 28, 2025

FastVLM: Self-Speculative Decoding for Fast Vision-Language Model Inference

Divya Jyoti Bajpai, Manjesh Kumar Hanawal

PDF

TL;DR

FastVLM introduces a self-speculative decoding framework that significantly accelerates vision-language model inference by combining a lightweight draft model with a full model for verification, maintaining accuracy while reducing latency.

Contribution

The paper proposes a novel imitation-learning-based SSD framework that improves inference speed of VLMs without sacrificing performance, using a lightweight draft model and a verification mechanism.

Findings

01

Speeds up inference by 1.55-1.85x

02

Maintains performance with minimal accuracy loss

03

Integrates a self-speculative decoding approach

Abstract

Vision-language Models (VLMs) have made significant strides in visual understanding and query response generation, but often face challenges of high computational cost and inference latency due to autoregressive decoding. In this work, we introduce an imitation-learning-based Self-Speculative Decoding (SSD) framework, named FastVLM, to address these limitations. Our approach employs a lightweight draft model for token generation in an autoregressive manner, while a full model verifies these tokens non-autoregressively. Accepted tokens proceed seamlessly, while rejected tokens are corrected by the full model and used to guide the draft model's refinement. Through an imitation network, FastVLM enhances the draft model by integrating deeper level insights from the full model's architecture. Also, it maintains the performance integrity of the full model while training the draft model,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.