Speculative Decoding Reimagined for Multimodal Large Language Models
Luxi Lin, Zhihang Lin, Zhanpeng Zeng, Rongrong Ji

TL;DR
This paper proposes Multimodal Speculative Decoding (MSD), a novel approach to accelerate inference in Multimodal Large Language Models by decoupling text and visual processing and employing a two-stage training strategy.
Contribution
It introduces MSD, which reimagines speculative decoding for MLLMs by decoupling modalities and using staged training, achieving significant speedups without accuracy loss.
Findings
MSD achieves up to 2.46x speedup on multimodal benchmarks.
Decoupling text and visual tokens improves decoding efficiency.
Two-stage training enhances both language and visual perception capabilities.
Abstract
This paper introduces Multimodal Speculative Decoding (MSD) to accelerate Multimodal Large Language Models (MLLMs) inference. Speculative decoding has been shown to accelerate Large Language Models (LLMs) without sacrificing accuracy. However, current speculative decoding methods for MLLMs fail to achieve the same speedup as they do for LLMs. To address this, we reimagine speculative decoding specifically for MLLMs. Our analysis of MLLM characteristics reveals two key design principles for MSD: (1) Text and visual tokens have fundamentally different characteristics and need to be processed separately during drafting. (2) Both language modeling ability and visual perception capability are crucial for the draft model. For the first principle, MSD decouples text and visual tokens in the draft model, allowing each to be handled based on its own characteristics. For the second principle, MSD…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling
MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings
