S2-Attention: Hardware-Aware Context Sharding Among Attention Heads

Xihui Lin; Yunan Zhang; Suyu Ge; Liliang Ren; Barun Patra; Vishrav; Chaudhary; Hao Peng; Xia Song

arXiv:2407.17678·cs.CL·February 6, 2025

S2-Attention: Hardware-Aware Context Sharding Among Attention Heads

Xihui Lin, Yunan Zhang, Suyu Ge, Liliang Ren, Barun Patra, Vishrav, Chaudhary, Hao Peng, Xia Song

PDF

Open Access 3 Reviews

TL;DR

S2-Attention introduces a hardware-aware, sharded sparse attention method that significantly accelerates large language model inference while maintaining quality, through a novel kernel optimization and heterogeneous context sharding.

Contribution

The paper presents S2-Attention, a kernel-optimized, hardware-aware sparse attention technique with heterogeneous sharding, enabling practical speedups and strong performance in large language models.

Findings

01

Achieves up to 25.3X speedup over FlashAttention-2

02

Maintains strong downstream performance at 128k context length

03

Enables 4.5X inference speed-up for 7B models

Abstract

Sparse attention, which selectively attends to a subset of tokens in the context was supposed to be efficient. However, its theoretical reduction in FLOPs has rarely translated into wall-clock speed-up over its dense attention counterparts due to the lack of hardware-aware optimizations like FlashAttention. Meanwhile, it remains unclear whether sparse attention can maintain the model's quality at a scale of today's large language models (LLMs) and how. This paper presents Sparsely-Sharded(S2) Attention, a Triton library that provides kernel optimization for sparse attention customizable at both per-head and per-context-range levels. S2-Attention enables the exploration of novel and high-performance sparse attention techniques, which we demonstrate through extensive ablations across a wide range of sparse attention designs at various model scales. From these insights, we present several…

Peer Reviews

Decision·Submitted to ICLR 2025

Reviewer 01Rating 3Confidence 4

Strengths

1. Useful libarary. The paper implements a practical sparse attention GPU kernel library that supports both training and inference. The flexibility to support fine-grained sparse patterns can benefit future research towards more effective and efficient sparse pattern design. 2. High efficiency. With the optimized sparse attention kernel, the paper shows speedups of up to 25.3 and 4.5 times for training and inference over the dense FlashAttention baseline.

Weaknesses

1. The main concern of the paper lies in the proposed sparse attention pattern design. The proposed KV-Cache design principle seems overly conclusive and conflicts with existing works. a. The principle itself is not novel; similar sparse pattern designs for KV-Cache optimization have been explored extensively in prior studies, such as [1, 2]. Furthermore, recent work on retrieval-based KV-Cache reduction [3] demonstrates high performance despite contradicting this principle. It would be ben

Reviewer 02Rating 6Confidence 3

Strengths

1. The paper presents a novel approach to improving the real-world efficiency of sparse attention mechanisms in LLMs through S2-Attention, a customizable, hardware-optimized library. Unlike prior sparse attention methods that often fail to deliver actual speedups, S2-Attention effectively addresses the GPU memory access bottleneck. Additionally, the hybrid architecture combining sparse and dense layers is an innovative solution to balance efficiency and model performance. 2. The paper demonstrat

Weaknesses

The paper is innovative in its approach and thorough experimentation. However, there are several critical questions that I raised in the "Question" section, which I believe are essential for the clarity and robustness of the findings. I hope the authors can provide insights on these points, and I look forward to further discussion.

Reviewer 03Rating 3Confidence 3

Strengths

+ this work presents a flexible kernel implementation that supports finer-grained sparse attention. Previous work FlashAttention-2 requires the sparsity granularity to be same as the block size, while this work introduces Merge-Q technique to effectively decouple the granularity of sparsity pattern and attention computation while achieving the expected speedup. + this work provides a detailed accuracy comparison to demonstrate the effectiveness of heterogeneous context sharing and union complete

Weaknesses

- S2-Attention requires training models from scratch, raising concerns about its compatibility with pre-trained models. This limits its flexibility compared to other sparse attention methods (e.g., QUEST, H2O) that support plug-and-play integration. - the benefits of supporting finer-grained sparsity remain unclear; if existing block sparse attention methods suffice, the proposed library may be less practical.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsFace and Expression Recognition · Hand Gesture Recognition Systems

MethodsSoftmax · Attention Is All You Need · Lib