Making Every Head Count: Sparse Attention Without the Speed-Performance Trade-off

Mingkuan Zhao; Wentao Hu; Jiayin Wang; Xin Lai; Tianchen Huang; Yuheng Min; Rui Yan; Xiaoyan Zhu

arXiv:2511.09596·cs.LG·December 1, 2025

Making Every Head Count: Sparse Attention Without the Speed-Performance Trade-off

Mingkuan Zhao, Wentao Hu, Jiayin Wang, Xin Lai, Tianchen Huang, Yuheng Min, Rui Yan, Xiaoyan Zhu

PDF

Open Access 1 Video

TL;DR

This paper introduces SPAttention, a novel sparse attention mechanism that reorganizes multi-head attention to reduce computational complexity from H*N^2 to N^2 without sacrificing performance, enabling more efficient large language models.

Contribution

The paper proposes Principled Structural Sparsity in attention, transforming multi-head attention into a collaborative, single computation, improving efficiency and model performance.

Findings

01

Reduces attention complexity from H*N^2 to N^2

02

Enhances computational efficiency without performance loss

03

Encourages head specialization for better dependency modeling

Abstract

The design of Large Language Models (LLMs) has long been hampered by a fundamental conflict within their core attention mechanism: its remarkable expressivity is built upon a computational complexity of O(H N^2) that grows quadratically with the context size (N) and linearly with the number of heads (H). This standard implementation harbors significant computational redundancy, as all heads independently compute attention over the same sequence space. Existing sparse methods, meanwhile, often trade information integrity for computational efficiency. To resolve this efficiency-performance trade-off, we propose SPAttention, whose core contribution is the introduction of a new paradigm we term Principled Structural Sparsity. SPAttention does not merely drop connections but instead reorganizes the computational task by partitioning the total attention workload into balanced, non-overlapping…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Making Every Head Count: Sparse Attention Without the Speed-Performance Trade-off· underline

Taxonomy

TopicsTopic Modeling · Big Data and Digital Economy · Multimodal Machine Learning Applications