SpecVLM: Enhancing Speculative Decoding of Video LLMs via Verifier-Guided Token Pruning
Yicheng Ji, Jun Zhang, Heming Xia, Jinpeng Chen, Lidan Shou, Gang Chen, Huan Li

TL;DR
SpecVLM introduces a verifier-guided token pruning framework that significantly accelerates video LLM decoding by pruning up to 90% of tokens without losing accuracy, leveraging staged pruning and low sensitivity of speculation.
Contribution
It presents a training-free speculative decoding method with staged token pruning for Vid-LLMs, achieving lossless acceleration and robustness.
Findings
Up to 2.68× decoding speedup on benchmark models.
Prunes up to 90% of video tokens without accuracy loss.
Effective across multiple video understanding benchmarks.
Abstract
Video large language models (Vid-LLMs) have shown strong capabilities in understanding video content. However, their reliance on dense video token representations introduces substantial memory and computational overhead in both prefilling and decoding. To mitigate the information loss of recent video token reduction methods and accelerate the decoding stage of Vid-LLMs losslessly, we introduce SpecVLM, a training-free speculative decoding (SD) framework tailored for Vid-LLMs that incorporates staged video token pruning. Building on our novel finding that the draft model's speculation exhibits low sensitivity to video token pruning, SpecVLM prunes up to 90% of video tokens to enable efficient speculation without sacrificing accuracy. To achieve this, we performs a two-stage pruning process: Stage I selects highly informative tokens guided by attention signals from the verifier (target…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsVideo Coding and Compression Technologies · Generative Adversarial Networks and Image Synthesis · Advanced Steganography and Watermarking Techniques
