Scaling Attention via Feature Sparsity

Yan Xie; Tiansheng Wen; Tangda Huang; Bo Chen; Chenyu You; Stefanie Jegelka; Yifei Wang

arXiv:2603.22300·cs.LG·March 31, 2026

Scaling Attention via Feature Sparsity

Yan Xie, Tiansheng Wen, Tangda Huang, Bo Chen, Chenyu You, Stefanie Jegelka, Yifei Wang

PDF

1 Repo 1 Video

TL;DR

This paper introduces Sparse Feature Attention (SFA), a novel method that leverages feature sparsity to efficiently scale Transformers to longer contexts while maintaining accuracy.

Contribution

The authors propose SFA and FlashSFA, enabling scalable, accurate attention by exploiting feature sparsity and efficient sparse computations, outperforming existing methods in speed and resource usage.

Findings

01

SFA matches dense baseline accuracy in GPT-2 and Qwen3 pretraining.

02

SFA achieves up to 2.5x speedup and nearly 50% reduction in FLOPs and KV-cache.

03

SFA preserves retrieval accuracy and robustness at long contexts.

Abstract

Scaling Transformers to ultra-long contexts is bottlenecked by the $O (n^{2} d)$ cost of self-attention. Existing methods reduce this cost along the sequence axis through local windows, kernel approximations, or token-level sparsity, but these approaches consistently degrade accuracy. In this paper, we instead explore an orthogonal axis: feature sparsity. We propose Sparse Feature Attention (SFA), where queries and keys are represented as $k$ -sparse codes that preserve high-dimensional expressivity while reducing the cost of attention from $Θ (n^{2} d)$ to $Θ (n^{2} k^{2} / d)$ . To make this efficient at scale, we introduce FlashSFA, an IO-aware kernel that extends FlashAttention to operate directly on sparse overlaps without materializing dense score matrices. Across GPT-2 and Qwen3 pretraining, SFA matches dense baselines while improving speed by up to $2.5 \times$ and reducing FLOPs and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

YannX1e/Sparse-Feature-Attention
github

Videos

Scaling Attention via Feature Sparsity· slideslive