Scaling Context Requires Rethinking Attention

Carles Gelada; Jacob Buckman; Sean Zhang; Txus Bach

arXiv:2507.04239·cs.LG·July 8, 2025

Scaling Context Requires Rethinking Attention

Carles Gelada, Jacob Buckman, Sean Zhang, Txus Bach

PDF

TL;DR

This paper introduces power attention, a new linear-cost sequence modeling layer that overcomes limitations of existing attention mechanisms at long sequence lengths, enabling efficient and effective in-context learning.

Contribution

The authors propose power attention, a novel architectural layer with adjustable state size, and provide optimized GPU kernels to improve long-context sequence modeling performance.

Findings

01

Power attention outperforms exponential and linear attention in long-context in-context learning.

02

Efficient GPU kernels enable scalable deployment of power attention.

03

Power attention maintains effectiveness without increasing computational costs.

Abstract

We argue that neither transformers nor sub-quadratic architectures are well suited to training at long sequence lengths: the cost of processing the context is too expensive in the former, too inexpensive in the latter. Approaches such as sliding window attention which reduce the cost-per-token of a transformer impair in-context learning, and so are also unsuitable. To address these limitations, we introduce power attention, an architectural layer for linear-cost sequence modeling whose state size can be adjusted independently of parameters, unlocking the advantages of linear attention on practical domains. We develop and open-source a set of GPU kernels for efficient power attention, identifying a novel pattern of operation fusion to avoid memory and bandwidth bottlenecks. Our experiments on the in-context learning of power attention shows that these models dominate both exponential…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.