Sparrow: Text-Anchored Window Attention with Visual-Semantic Glimpsing for Speculative Decoding in Video LLMs
Libo Zhang, Zhaoning Zhang, Wangyang Hong, Peng Qiao, Dongsheng Li

TL;DR
This paper introduces Sparrow, a novel framework that enhances speculative decoding in Video Large Language Models by leveraging visual-semantic internalization, resulting in significant speedups and improved performance on long video sequences.
Contribution
Sparrow employs text-anchored window attention and semantic-rich intermediate states to mitigate attention dilution and visual noise, enabling efficient long video inference in Vid-LLMs.
Findings
Achieves 2.82x speedup on long video sequences
Resolves performance degradation in long sequences
Effectively handles 25k visual tokens in real-time tasks
Abstract
Although speculative decoding is widely used to accelerate Vision-Language Models (VLMs) inference, it faces severe performance collapse when applied to Video Large Language Models (Vid-LLMs). The draft model typically falls into the trap of attention dilution and negative visual gain due to key-value cache explosion and context window mismatches. We observe a visual semantic internalization phenomenon in Vid-LLMs, indicating that critical visual semantics are implicitly encoded into text hidden states during deep-layer interactions, which renders raw visual inputs structurally redundant during deep inference. To address this, we propose the Sparrow framework, which first utilizes visually-aware text-anchored window attention via hidden state reuse to fully offload visual computation to the target model, and leverages intermediate-layer visual state bridging to train the draft model…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Domain Adaptation and Few-Shot Learning
