S2-Attention: Hardware-Aware Context Sharding Among Attention Heads
Xihui Lin, Yunan Zhang, Suyu Ge, Liliang Ren, Barun Patra, Vishrav, Chaudhary, Hao Peng, Xia Song

TL;DR
S2-Attention introduces a hardware-aware, sharded sparse attention method that significantly accelerates large language model inference while maintaining quality, through a novel kernel optimization and heterogeneous context sharding.
Contribution
The paper presents S2-Attention, a kernel-optimized, hardware-aware sparse attention technique with heterogeneous sharding, enabling practical speedups and strong performance in large language models.
Findings
Achieves up to 25.3X speedup over FlashAttention-2
Maintains strong downstream performance at 128k context length
Enables 4.5X inference speed-up for 7B models
Abstract
Sparse attention, which selectively attends to a subset of tokens in the context was supposed to be efficient. However, its theoretical reduction in FLOPs has rarely translated into wall-clock speed-up over its dense attention counterparts due to the lack of hardware-aware optimizations like FlashAttention. Meanwhile, it remains unclear whether sparse attention can maintain the model's quality at a scale of today's large language models (LLMs) and how. This paper presents Sparsely-Sharded(S2) Attention, a Triton library that provides kernel optimization for sparse attention customizable at both per-head and per-context-range levels. S2-Attention enables the exploration of novel and high-performance sparse attention techniques, which we demonstrate through extensive ablations across a wide range of sparse attention designs at various model scales. From these insights, we present several…
Peer Reviews
Decision·Submitted to ICLR 2025
1. Useful libarary. The paper implements a practical sparse attention GPU kernel library that supports both training and inference. The flexibility to support fine-grained sparse patterns can benefit future research towards more effective and efficient sparse pattern design. 2. High efficiency. With the optimized sparse attention kernel, the paper shows speedups of up to 25.3 and 4.5 times for training and inference over the dense FlashAttention baseline.
1. The main concern of the paper lies in the proposed sparse attention pattern design. The proposed KV-Cache design principle seems overly conclusive and conflicts with existing works. a. The principle itself is not novel; similar sparse pattern designs for KV-Cache optimization have been explored extensively in prior studies, such as [1, 2]. Furthermore, recent work on retrieval-based KV-Cache reduction [3] demonstrates high performance despite contradicting this principle. It would be ben
1. The paper presents a novel approach to improving the real-world efficiency of sparse attention mechanisms in LLMs through S2-Attention, a customizable, hardware-optimized library. Unlike prior sparse attention methods that often fail to deliver actual speedups, S2-Attention effectively addresses the GPU memory access bottleneck. Additionally, the hybrid architecture combining sparse and dense layers is an innovative solution to balance efficiency and model performance. 2. The paper demonstrat
The paper is innovative in its approach and thorough experimentation. However, there are several critical questions that I raised in the "Question" section, which I believe are essential for the clarity and robustness of the findings. I hope the authors can provide insights on these points, and I look forward to further discussion.
+ this work presents a flexible kernel implementation that supports finer-grained sparse attention. Previous work FlashAttention-2 requires the sparsity granularity to be same as the block size, while this work introduces Merge-Q technique to effectively decouple the granularity of sparsity pattern and attention computation while achieving the expected speedup. + this work provides a detailed accuracy comparison to demonstrate the effectiveness of heterogeneous context sharing and union complete
- S2-Attention requires training models from scratch, raising concerns about its compatibility with pre-trained models. This limits its flexibility compared to other sparse attention methods (e.g., QUEST, H2O) that support plug-and-play integration. - the benefits of supporting finer-grained sparsity remain unclear; if existing block sparse attention methods suffice, the proposed library may be less practical.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsFace and Expression Recognition · Hand Gesture Recognition Systems
MethodsSoftmax · Attention Is All You Need · Lib
