Speculative Decoding Reimagined for Multimodal Large Language Models

Luxi Lin; Zhihang Lin; Zhanpeng Zeng; Rongrong Ji

arXiv:2505.14260·cs.CV·May 21, 2025

Speculative Decoding Reimagined for Multimodal Large Language Models

Luxi Lin, Zhihang Lin, Zhanpeng Zeng, Rongrong Ji

PDF

Open Access 1 Repo

TL;DR

This paper proposes Multimodal Speculative Decoding (MSD), a novel approach to accelerate inference in Multimodal Large Language Models by decoupling text and visual processing and employing a two-stage training strategy.

Contribution

It introduces MSD, which reimagines speculative decoding for MLLMs by decoupling modalities and using staged training, achieving significant speedups without accuracy loss.

Findings

01

MSD achieves up to 2.46x speedup on multimodal benchmarks.

02

Decoupling text and visual tokens improves decoding efficiency.

03

Two-stage training enhances both language and visual perception capabilities.

Abstract

This paper introduces Multimodal Speculative Decoding (MSD) to accelerate Multimodal Large Language Models (MLLMs) inference. Speculative decoding has been shown to accelerate Large Language Models (LLMs) without sacrificing accuracy. However, current speculative decoding methods for MLLMs fail to achieve the same speedup as they do for LLMs. To address this, we reimagine speculative decoding specifically for MLLMs. Our analysis of MLLM characteristics reveals two key design principles for MSD: (1) Text and visual tokens have fundamentally different characteristics and need to be processed separately during drafting. (2) Both language modeling ability and visual perception capability are crucial for the draft model. For the first principle, MSD decouples text and visual tokens in the draft model, allowing each to be handled based on its own characteristics. For the second principle, MSD…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

lyn-lucy/msd
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings