MMSpec: Benchmarking Speculative Decoding for Vision-Language Models

Hui Shen; Xin Wang; Ping Zhang; Yunta Hsieh; Qi Han; Zhongwei Wan; Ziheng Zhang; Jingxuan Zhang; Jing Xiong; Ziyuan Liu; Yifan Zhang; Hangrui Cao; Chenyang Zhao; Mi Zhang

arXiv:2603.14989·cs.CV·March 17, 2026

MMSpec: Benchmarking Speculative Decoding for Vision-Language Models

Hui Shen, Xin Wang, Ping Zhang, Yunta Hsieh, Qi Han, Zhongwei Wan, Ziheng Zhang, Jingxuan Zhang, Jing Xiong, Ziyuan Liu, Yifan Zhang, Hangrui Cao, Chenyang Zhao, Mi Zhang

PDF

Open Access

TL;DR

This paper introduces MMSpec, a comprehensive benchmark for evaluating speculative decoding in vision-language models, revealing key insights and proposing a new adaptive decoding method called ViSkip for improved performance.

Contribution

It presents MMSpec, the first benchmark for speculative decoding in VLMs, and proposes ViSkip, a novel adaptive decoding method that enhances inference efficiency.

Findings

01

Text-only optimized methods degrade in multimodal scenarios

02

Vision awareness is crucial at larger batch sizes

03

Throughput speedup does not always correlate with latency improvements

Abstract

Vision-language models (VLMs) achieve strong performance on multimodal tasks but suffer from high inference latency due to large model sizes and long multimodal contexts. Speculative decoding has recently emerged as an effective acceleration technique, yet its behavior in VLMs remains insufficiently understood. We introduce MMSpec, the first benchmark for evaluating speculative decoding in vision-language models. MMSpec contains 600 multimodal samples across six task categories and integrates ten representative speculative decoding algorithms under a unified evaluation framework. Our study reveals three key findings: (1) methods designed for text-only LLMs degrade in multimodal scenarios, (2) vision awareness becomes increasingly important at larger batch sizes, and (3) throughput speedup alone does not reliably reflect latency performance. Motivated by these findings, we propose…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Topic Modeling