TL;DR
Sparse-to-Dense (StD) is a decoding strategy for Video-LLMs that accelerates inference by combining sparse and dense attention modules, achieving nearly double the speed without performance loss.
Contribution
The paper introduces StD, a novel, tuning-free decoding method that integrates sparse and dense attention to speed up Video-LLMs during inference.
Findings
Achieves up to 1.94× speedup in video processing.
Maintains model performance while accelerating inference.
Seamlessly transitions from standard to sparse Video-LLMs with minimal code changes.
Abstract
Due to the auto-regressive nature of current video large language models (Video-LLMs), the inference latency increases as the input sequence length grows, posing challenges for the efficient processing of video sequences that are usually very long. We observe that during decoding, the attention scores of most tokens in Video-LLMs tend to be sparse and concentrated, with only certain tokens requiring comprehensive full attention. Based on this insight, we introduce Sparse-to-Dense (StD), a novel decoding strategy that integrates two distinct modules: one leveraging sparse top-K attention and the other employing dense full attention. These modules collaborate to accelerate Video-LLMs without loss. The fast (sparse) model speculatively decodes multiple tokens, while the slow (dense) model verifies them in parallel. StD is a tuning-free, plug-and-play solution that achieves up to a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsAdvanced Neural Network Applications · Generative Adversarial Networks and Image Synthesis
MethodsSoftmax · Attention Is All You Need · Spatial-Channel Token Distillation
