Efficient Attention Mechanisms for Large Language Models: A Survey

Yutao Sun; Zhenyu Li; Yike Zhang; Tengyu Pan; Bowen Dong; Yuyi Guo; Jianyong Wang

arXiv:2507.19595·cs.CL·February 10, 2026

Efficient Attention Mechanisms for Large Language Models: A Survey

Yutao Sun, Zhenyu Li, Yike Zhang, Tengyu Pan, Bowen Dong, Yuyi Guo, Jianyong Wang

PDF

TL;DR

This survey reviews recent advancements in efficient attention mechanisms for large language models, focusing on linear and sparse methods to reduce computational complexity while maintaining performance.

Contribution

It provides a comprehensive overview of algorithmic and hardware innovations in efficient attention, including integration into large-scale pre-trained models and hybrid architectures.

Findings

01

Linear attention achieves scalable inference with kernel approximations and fastweight dynamics.

02

Sparse attention enhances efficiency by limiting token interactions through fixed or learned patterns.

03

The survey bridges theoretical foundations with practical deployment strategies for efficient language models.

Abstract

Transformer-based architectures have become the prevailing backbone of large language models. However, the quadratic time and memory complexity of self-attention remains a fundamental obstacle to efficient long-context modeling. To address this limitation, recent research has introduced two principal categories of efficient attention mechanisms. Linear attention methods achieve linear complexity through kernel approximations, recurrent formulations, or fastweight dynamics, thereby enabling scalable inference with reduced computational overhead. Sparse attention techniques, in contrast, limit attention computation to selected subsets of tokens based on fixed patterns, block-wise routing, or clustering strategies, enhancing efficiency while preserving contextual coverage. This survey provides a systematic and comprehensive overview of these developments, integrating both algorithmic…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.