See the Forest for the Trees: Loosely Speculative Decoding via Visual-Semantic Guidance for Efficient Inference of Video LLMs

Yicheng Ji; Jun Zhang; Jinpeng Chen; Cong Wang; Lidan Shou; Gang Chen; Huan Li

arXiv:2604.05650·cs.CL·April 10, 2026

See the Forest for the Trees: Loosely Speculative Decoding via Visual-Semantic Guidance for Efficient Inference of Video LLMs

Yicheng Ji, Jun Zhang, Jinpeng Chen, Cong Wang, Lidan Shou, Gang Chen, Huan Li

PDF

TL;DR

LVSpec is a training-free, visual-semantic guided speculative decoding framework that significantly accelerates video large language models while maintaining high fidelity.

Contribution

It introduces LVSpec, a novel loosely speculative decoding method tailored for Video-LLMs, enhancing speed without retraining or strict match rules.

Findings

01

LVSpec accelerates Video-LLMs by approximately 2.7x to 2.9x.

02

It maintains over 99.8% of target performance.

03

LVSpec outperforms state-of-the-art training-free SD methods in speed and acceptance ratio.

Abstract

Video Large Language Models (Video-LLMs) excel in video understanding but suffer from high inference latency during autoregressive generation. Speculative Decoding (SD) mitigates this by applying a draft-and-verify paradigm, yet existing methods are constrained by rigid exact-match rules, severely limiting the acceleration potential. To bridge this gap, we propose LVSpec, the first training-free loosely SD framework tailored for Video-LLMs. Grounded in the insight that generation is governed by sparse visual-relevant anchors (mandating strictness) amidst abundant visual-irrelevant fillers (permitting loose verification), LVSpec employs a lightweight visual-relevant token identification scheme to accurately pinpoint the former. To further maximize acceptance, we augment this with a position-shift tolerant mechanism that effectively salvages positionally mismatched but semantically…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.