A Unified Sparse Attention via Multi-Granularity Compression
Siran Liu, Zane Cao, Yongchao He

TL;DR
UniSparse introduces a unified sparse attention mechanism using multi-granularity compression and composite tokens, significantly improving efficiency and accuracy for long-context understanding in large language models across various modalities.
Contribution
The paper proposes UniSparse, a novel sparse attention method that dynamically constructs attention using composite tokens and multi-granularity compression, addressing limitations of existing approaches.
Findings
Achieves ≥99% of full-attention accuracy.
Up to 2.61× faster attention computation than FlashAttention.
Effective across multiple modalities and tasks.
Abstract
Efficient long-context understanding and reasoning are increasingly vital for large language model (LLM) applications such as multi-turn dialogue and program analysis. However, the core self-attention mechanism scales quadratically with sequence length, creating a fundamental computational bottleneck. Existing sparse attention methods alleviate this issue but face trade-offs: training-based methods are costly and cannot be directly applied as acceleration plugins for other models, while inference-time methods often compromise efficiency or cross-modal generality. To address these limitations, we present UniSparse, a unified mechanism that introduces the notion of composite tokens--compact representations that aggregate multi-granularity contextual information. Building on this abstraction, UniSparse dynamically constructs sparse attention through multi-granularity compression and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications
