Sparse-to-Dense: A Free Lunch for Lossless Acceleration of Video Understanding in LLMs

Xuan Zhang; Cunxiao Du; Sicheng Yu; Jiawei Wu; Fengzhuo Zhang; Wei Gao; Qian Liu

arXiv:2505.19155·cs.CV·May 19, 2026

Sparse-to-Dense: A Free Lunch for Lossless Acceleration of Video Understanding in LLMs

Xuan Zhang, Cunxiao Du, Sicheng Yu, Jiawei Wu, Fengzhuo Zhang, Wei Gao, Qian Liu

PDF

1 Video

TL;DR

Sparse-to-Dense (StD) is a decoding strategy for Video-LLMs that accelerates inference by combining sparse and dense attention modules, achieving nearly double the speed without performance loss.

Contribution

The paper introduces StD, a novel, tuning-free decoding method that integrates sparse and dense attention to speed up Video-LLMs during inference.

Findings

01

Achieves up to 1.94× speedup in video processing.

02

Maintains model performance while accelerating inference.

03

Seamlessly transitions from standard to sparse Video-LLMs with minimal code changes.

Abstract

Due to the auto-regressive nature of current video large language models (Video-LLMs), the inference latency increases as the input sequence length grows, posing challenges for the efficient processing of video sequences that are usually very long. We observe that during decoding, the attention scores of most tokens in Video-LLMs tend to be sparse and concentrated, with only certain tokens requiring comprehensive full attention. Based on this insight, we introduce Sparse-to-Dense (StD), a novel decoding strategy that integrates two distinct modules: one leveraging sparse top-K attention and the other employing dense full attention. These modules collaborate to accelerate Video-LLMs without loss. The fast (sparse) model speculatively decodes multiple tokens, while the slow (dense) model verifies them in parallel. StD is a tuning-free, plug-and-play solution that achieves up to a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Sparse-to-Dense: A Free Lunch for Lossless Acceleration of Video Understanding in LLMs· underline

Taxonomy

TopicsAdvanced Neural Network Applications · Generative Adversarial Networks and Image Synthesis

MethodsSoftmax · Attention Is All You Need · Spatial-Channel Token Distillation