SpecFLASH: A Latent-Guided Semi-autoregressive Speculative Decoding Framework for Efficient Multimodal Generation

Zihua Wang; Ruibo Li; Haozhe Du; Joey Tianyi Zhou; Yu Zhang; Xu Yang

arXiv:2505.12728·cs.CV·February 4, 2026

SpecFLASH: A Latent-Guided Semi-autoregressive Speculative Decoding Framework for Efficient Multimodal Generation

Zihua Wang, Ruibo Li, Haozhe Du, Joey Tianyi Zhou, Yu Zhang, Xu Yang

PDF

Open Access 1 Repo

TL;DR

SpecFLASH is a novel speculative decoding framework for multimodal models that leverages visual structure and semi-autoregressive prediction to significantly accelerate inference without quality loss.

Contribution

It introduces a latent-guided token compression and semi-autoregressive decoding tailored for multimodal models, improving inference speed over prior methods.

Findings

01

Achieves up to 2.68x speed-up on video captioning

02

Achieves up to 2.55x speed-up on visual instruction tuning

03

Surpasses prior speculative decoding baselines in efficiency

Abstract

Large language models and large multimodal models (LLMs and LMMs) deliver strong generative performance but suffer from slow decoding, a problem that becomes more severe when handling visual inputs, whose sequences typically contain many more tokens with lower information density than text. Speculative decoding accelerates LLM inference by letting a compact draft model propose candidate tokens that are selectively accepted by a larger target model, achieving speed-up without degrading quality. However, existing multimodal speculative decoding approaches largely ignore the structural characteristics of visual representations and usually rely on text-only draft models. In this paper, we introduce SpecFLASH, a speculative decoding framework tailored to LMMs that explicitly exploits multimodal structure when designing the draft model. We first mitigate redundancy in visual token sequences…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

zihuaevan/flashsd
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Face recognition and analysis

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings