OmniSparse: Training-Aware Fine-Grained Sparse Attention for Long-Video MLLMs
Feng Chen, Yefei He, Shaoxuan He, Yuanyu He, Jing Liu, Lequan Lin, Akide Liu, Zhaoyang Li, Jiyuan Zhang, Zhenbang Sun, Bohan Zhuang, Qi Wu

TL;DR
OmniSparse introduces a training-aware, fine-grained sparse attention method for long-video multimodal language models, enabling dynamic token selection and significant speed and memory improvements while maintaining full attention performance.
Contribution
It proposes a novel framework with adaptive mechanisms for token and head selection that operate during both training and inference, bridging the training-inference gap in sparse attention methods.
Findings
Achieves up to 2.7x speedup during prefill
Reduces memory usage by 2.4x during decoding
Maintains full attention performance
Abstract
Existing sparse attention methods primarily target inference-time acceleration by selecting critical tokens under predefined sparsity patterns. However, they often fail to bridge the training-inference gap and lack the capacity for fine-grained token selection across multiple dimensions such as queries, key-values (KV), and heads, leading to suboptimal performance and limited acceleration gains. In this paper, we introduce OmniSparse, a training-aware fine-grained sparse attention framework for long-video MLLMs, which operates in both training and inference with dynamic token budget allocation. Specifically, OmniSparse contains three adaptive and complementary mechanisms: (1) query selection via lazy-active classification, retaining active queries that capture broad semantic similarity while discarding most lazy ones that focus on limited local context and exhibit high functional…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsImage and Video Quality Assessment · Advanced Neural Network Applications · Visual Attention and Saliency Detection
