TL;DR
This paper introduces Flash Sparse Attention (FSA), a kernel implementation that significantly improves the efficiency of native sparse attention in large language models, especially with smaller query head groups, enabling faster training and inference.
Contribution
FSA provides an efficient kernel implementation for NSA that works well with smaller query head groups, broadening its applicability across various LLM architectures.
Findings
Up to 3.5x kernel latency reduction
Up to 1.25x end-to-end training speedup
Up to 1.36x prefill-phase speedup
Abstract
Recent advance in sparse attention mechanisms has demonstrated strong potential for reducing the computational cost of long-context training and inference in large language models (LLMs). Native Sparse Attention (NSA), one state-of-the-art approach, introduces natively trainable, hardware-aligned sparse attention that delivers substantial system-level performance boost while maintaining accuracy comparable to full attention. However, the kernel implementation of NSA forces a loop order that is only efficient with a relatively large number of query heads in each Grouped Query Attention (GQA) group, whereas existing LLMs widely adopt much smaller number of query heads in each GQA group -- such an inconsistency significantly limits the applicability of this sparse algorithmic advance. In this work, we propose Flash Sparse Attention (FSA), an alternative kernel implementation that enables…
Peer Reviews
Decision·ICLR 2026 Poster
1. The paper is well motivated. It identifies padding-driven inefficiency in NSA for common small-g GQA regimes and addresses it directly via loop reordering. 2. The paper presents a solid kernel design with detailed techniques such as non-contiguous query batching with early termination; decoupled reduction to avoid atomics; precomputed online softmax stats to maintain numerical correctness. 3. The paper provides comprehensive evaluation with microbenchmarks across GPUs and (BK,T) settings, plu
1. FSA only provides efficiency gains over NSA when each GQA group has few query heads, which limits its impact.
1. This paper aims to address an interesting and important problem in long-context LLM applications. 2. The presentation is good with clear writing. 3. The evaluation results are comprehensive and good.
1. What precision is used for evaluation? Is it FP8, FP16, or FP32? 2. I am wondering how would the proposed kernel could scale beyond 64K length, especially for inference. 3. I am wondering if the authors could provide any further insights into optimizing the proposed kernel for different GPU architectures, such as Hopper and Blackwell. 4. Some system-related works on sparse attention are missing [1-3]. [1] InfiniGen: Efficient Generative Inference of Large Language Models with Dynamic KV Cac
1. Problem Significance: Addresses the key bottleneck in deploying sparse attention: the incompatibility between native NSA kernels and mainstream LLMs' small GQA group sizes (typically 1-4 query heads). 2. Core Innovation: Inverts the NSA kernel's loop order (to "outer loop over KV blocks, inner loop over query tokens"), eliminating padding requirements for small GQA groups and significantly reducing redundant computation and memory access. Enhanced by memory management and specialized kernels.
1. Limited Generalization to Extreme Lengths: Evaluation only up to 64K tokens, lacking analysis of ultra-long contexts (128K/256K). Attention sink effects on FSA's dual-buffer design remain unexamined at extreme scales. 2. Missing SOTA Comparisons: Only compares FSA with vanilla NSA and full attention. Omits comparisons with recent sparse kernels like flashdecoding, limiting perspective on performance trade-offs. 3. Insufficient Accuracy Analysis: Accuracy claims rely solely on Llama3-8B result
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
