CoVSpec: Efficient Device-Edge Co-Inference for Vision-Language Models via Speculative Decoding
Yuanyuan Jia, Shunpu Tang, Qianqian Yang

TL;DR
CoVSpec is a novel framework that enhances device-edge collaborative inference for vision-language models by reducing visual tokens, adaptively adjusting decoding, and decoupling verification to improve efficiency and reduce communication costs.
Contribution
It introduces a training-free visual token reduction, an adaptive drafting strategy, and a parallel branching mechanism for efficient device-edge VLM inference.
Findings
Achieves up to 2.21x higher throughput than target-only inference.
Reduces communication overhead by over 96% compared to baselines.
Maintains task accuracy while significantly improving efficiency.
Abstract
Vision-language models (VLMs) have demonstrated strong capabilities in multimodal perception and reasoning. However, deploying large VLMs on mobile devices remains challenging due to their substantial computational and memory demands. A practical alternative is device-edge co-inference, where a lightweight draft VLM on the mobile device collaborates with a larger target VLM on the edge server via speculative decoding. Nevertheless, directly extending speculative decoding to VLMs suffers from severe inefficiency due to excessive visual-token computation and high communication overhead. To address these challenges, we propose CoVSpec, an efficient collaborative speculative decoding framework for VLM inference. Specifically, we first develop a training-free visual token reduction framework that prunes redundant visual tokens on the mobile device by jointly considering query relevance,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
