Sparrow: Text-Anchored Window Attention with Visual-Semantic Glimpsing for Speculative Decoding in Video LLMs

Libo Zhang; Zhaoning Zhang; Wangyang Hong; Peng Qiao; Dongsheng Li

arXiv:2602.15318·cs.CV·February 18, 2026

Sparrow: Text-Anchored Window Attention with Visual-Semantic Glimpsing for Speculative Decoding in Video LLMs

Libo Zhang, Zhaoning Zhang, Wangyang Hong, Peng Qiao, Dongsheng Li

PDF

Open Access

TL;DR

This paper introduces Sparrow, a novel framework that enhances speculative decoding in Video Large Language Models by leveraging visual-semantic internalization, resulting in significant speedups and improved performance on long video sequences.

Contribution

Sparrow employs text-anchored window attention and semantic-rich intermediate states to mitigate attention dilution and visual noise, enabling efficient long video inference in Vid-LLMs.

Findings

01

Achieves 2.82x speedup on long video sequences

02

Resolves performance degradation in long sequences

03

Effectively handles 25k visual tokens in real-time tasks

Abstract

Although speculative decoding is widely used to accelerate Vision-Language Models (VLMs) inference, it faces severe performance collapse when applied to Video Large Language Models (Vid-LLMs). The draft model typically falls into the trap of attention dilution and negative visual gain due to key-value cache explosion and context window mismatches. We observe a visual semantic internalization phenomenon in Vid-LLMs, indicating that critical visual semantics are implicitly encoded into text hidden states during deep-layer interactions, which renders raw visual inputs structurally redundant during deep inference. To address this, we propose the Sparrow framework, which first utilizes visually-aware text-anchored window attention via hidden state reuse to fully offload visual computation to the target model, and leverages intermediate-layer visual state bridging to train the draft model…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Domain Adaptation and Few-Shot Learning