FastVLM: Self-Speculative Decoding for Fast Vision-Language Model Inference
Divya Jyoti Bajpai, Manjesh Kumar Hanawal

TL;DR
FastVLM introduces a self-speculative decoding framework that significantly accelerates vision-language model inference by combining a lightweight draft model with a full model for verification, maintaining accuracy while reducing latency.
Contribution
The paper proposes a novel imitation-learning-based SSD framework that improves inference speed of VLMs without sacrificing performance, using a lightweight draft model and a verification mechanism.
Findings
Speeds up inference by 1.55-1.85x
Maintains performance with minimal accuracy loss
Integrates a self-speculative decoding approach
Abstract
Vision-language Models (VLMs) have made significant strides in visual understanding and query response generation, but often face challenges of high computational cost and inference latency due to autoregressive decoding. In this work, we introduce an imitation-learning-based Self-Speculative Decoding (SSD) framework, named FastVLM, to address these limitations. Our approach employs a lightweight draft model for token generation in an autoregressive manner, while a full model verifies these tokens non-autoregressively. Accepted tokens proceed seamlessly, while rejected tokens are corrected by the full model and used to guide the draft model's refinement. Through an imitation network, FastVLM enhances the draft model by integrating deeper level insights from the full model's architecture. Also, it maintains the performance integrity of the full model while training the draft model,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
