Long-Context Generalization with Sparse Attention

Pavlo Vasylenko; Hugo Pitorro; Andr\'e F. T. Martins; Marcos Treviso

arXiv:2506.16640·cs.CL·March 3, 2026

Long-Context Generalization with Sparse Attention

Pavlo Vasylenko, Hugo Pitorro, Andr\'e F. T. Martins, Marcos Treviso

PDF

Open Access 3 Reviews

TL;DR

This paper introduces ASEntmax, a learnable sparse attention mechanism for transformers that improves long-context generalization by focusing on relevant tokens, outperforming traditional softmax-based attention in various tasks.

Contribution

The paper proposes Adaptive-Scalable Entmax, a novel attention method with a learnable parameter, enabling dynamic sparsity and better long-range dependency modeling in transformers.

Findings

01

ASEntmax outperforms softmax and fixed $ ext{α}$-entmax in synthetic tasks.

02

Achieves up to 1000× length extrapolation on benchmarks.

03

Improves long-context language modeling with better perplexity and retrieval accuracy.

Abstract

Transformer-based architectures traditionally employ softmax to compute attention weights, which produces dense distributions over all tokens in a sequence. While effective in many settings, this density has been shown to be detrimental for tasks that demand precise focus on fixed-size patterns: as sequence length increases, non-informative tokens accumulate attention probability mass, leading to dispersion and representational collapse. We show in this paper that dynamically sparse attention mechanisms using $α$ -entmax can avoid these issues, due to their ability to assign exact zeros to irrelevant tokens. Furthermore, we introduce Adaptive-Scalable Entmax (ASEntmax), which endows $α$ -entmax with a learnable temperature parameter, allowing the attention distribution to interpolate between sparse (pattern-focused) and dense (softmax-like) regimes. Our empirical evaluation on…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 8Confidence 2

Strengths

* The problem of length generalization of Transformers is well-motivated and relevant. * The paper is thorough and rigorous in math, although I didn’t check all proofs and derivations. * The conclusions derived from principled theoretical analysis, are also plausible intuitively and corroborated by evidence. * The empirical validation is convincing, and the results are promising. Specifically, Entmax and ASEntmax show superior out-of-distribution generalization capabilities as compared to So

Weaknesses

* It would be helpful to also compare different attention activation functions for language modeling perplexity as in Table 3, but with industry-standard RoPE embeddings. Appendix H.9. does not provide such comparisons, but it seems that the RoPE models are already pre-trained, and you would only need to test their perplexity on these long-context benchmarks. * The paper could be more self-contained. To fully comprehend the material, it is necessary to familiarize oneself with several previous

Reviewer 02Rating 6Confidence 3

Strengths

- a very simple change to existing attention mechanism - works well across different datasets and tasks, showing consistent length generalization compared to baselines

Weaknesses

- there are some understanding problems that i'm facing, refer to the questions

Reviewer 03Rating 6Confidence 2

Strengths

1. The paper cleanly connects attention dispersion to representation collapse and over-compression with precise definitions and propositions. 2. The method plays nicely with NAPE and existing fast-attention kernels like FlashAttention/AdaSplash, which makes it engineer-friendly. Overall,I think it is solid, but I'm not fully sure my evaluation is correct.

Weaknesses

1. Entropy bounds rely on bounded-logit or near-Gaussian assumptions that may break on heavy-tailed or high-contrast distributions. 2. Deployment-critical profiles for memory, throughput, and tail latency are thin, especially for large models and ultra-long inputs. 3. Interplay with RoPE extensions, retrieval-style sparsity, or SSM hybrids is discussed but not systematically tested. 4. Robustness beyond NAPE and specific hyperparameters is unclear, so portability to other position encodings o

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAnomaly Detection Techniques and Applications · Image Retrieval and Classification Techniques · Time Series Analysis and Forecasting

MethodsSoftmax · Focus