A Unified Sparse Attention via Multi-Granularity Compression

Siran Liu; Zane Cao; Yongchao He

arXiv:2512.14082·cs.CL·December 17, 2025

A Unified Sparse Attention via Multi-Granularity Compression

Siran Liu, Zane Cao, Yongchao He

PDF

Open Access

TL;DR

UniSparse introduces a unified sparse attention mechanism using multi-granularity compression and composite tokens, significantly improving efficiency and accuracy for long-context understanding in large language models across various modalities.

Contribution

The paper proposes UniSparse, a novel sparse attention method that dynamically constructs attention using composite tokens and multi-granularity compression, addressing limitations of existing approaches.

Findings

01

Achieves ≥99% of full-attention accuracy.

02

Up to 2.61× faster attention computation than FlashAttention.

03

Effective across multiple modalities and tasks.

Abstract

Efficient long-context understanding and reasoning are increasingly vital for large language model (LLM) applications such as multi-turn dialogue and program analysis. However, the core self-attention mechanism scales quadratically with sequence length, creating a fundamental computational bottleneck. Existing sparse attention methods alleviate this issue but face trade-offs: training-based methods are costly and cannot be directly applied as acceleration plugins for other models, while inference-time methods often compromise efficiency or cross-modal generality. To address these limitations, we present UniSparse, a unified mechanism that introduces the notion of composite tokens--compact representations that aggregate multi-granularity contextual information. Building on this abstraction, UniSparse dynamically constructs sparse attention through multi-granularity compression and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications