Long-Context Generalization with Sparse Attention
Pavlo Vasylenko, Hugo Pitorro, Andr\'e F. T. Martins, Marcos Treviso

TL;DR
This paper introduces ASEntmax, a learnable sparse attention mechanism for transformers that improves long-context generalization by focusing on relevant tokens, outperforming traditional softmax-based attention in various tasks.
Contribution
The paper proposes Adaptive-Scalable Entmax, a novel attention method with a learnable parameter, enabling dynamic sparsity and better long-range dependency modeling in transformers.
Findings
ASEntmax outperforms softmax and fixed $ ext{α}$-entmax in synthetic tasks.
Achieves up to 1000× length extrapolation on benchmarks.
Improves long-context language modeling with better perplexity and retrieval accuracy.
Abstract
Transformer-based architectures traditionally employ softmax to compute attention weights, which produces dense distributions over all tokens in a sequence. While effective in many settings, this density has been shown to be detrimental for tasks that demand precise focus on fixed-size patterns: as sequence length increases, non-informative tokens accumulate attention probability mass, leading to dispersion and representational collapse. We show in this paper that dynamically sparse attention mechanisms using -entmax can avoid these issues, due to their ability to assign exact zeros to irrelevant tokens. Furthermore, we introduce Adaptive-Scalable Entmax (ASEntmax), which endows -entmax with a learnable temperature parameter, allowing the attention distribution to interpolate between sparse (pattern-focused) and dense (softmax-like) regimes. Our empirical evaluation on…
Peer Reviews
Decision·ICLR 2026 Poster
* The problem of length generalization of Transformers is well-motivated and relevant. * The paper is thorough and rigorous in math, although I didn’t check all proofs and derivations. * The conclusions derived from principled theoretical analysis, are also plausible intuitively and corroborated by evidence. * The empirical validation is convincing, and the results are promising. Specifically, Entmax and ASEntmax show superior out-of-distribution generalization capabilities as compared to So
* It would be helpful to also compare different attention activation functions for language modeling perplexity as in Table 3, but with industry-standard RoPE embeddings. Appendix H.9. does not provide such comparisons, but it seems that the RoPE models are already pre-trained, and you would only need to test their perplexity on these long-context benchmarks. * The paper could be more self-contained. To fully comprehend the material, it is necessary to familiarize oneself with several previous
- a very simple change to existing attention mechanism - works well across different datasets and tasks, showing consistent length generalization compared to baselines
- there are some understanding problems that i'm facing, refer to the questions
1. The paper cleanly connects attention dispersion to representation collapse and over-compression with precise definitions and propositions. 2. The method plays nicely with NAPE and existing fast-attention kernels like FlashAttention/AdaSplash, which makes it engineer-friendly. Overall,I think it is solid, but I'm not fully sure my evaluation is correct.
1. Entropy bounds rely on bounded-logit or near-Gaussian assumptions that may break on heavy-tailed or high-contrast distributions. 2. Deployment-critical profiles for memory, throughput, and tail latency are thin, especially for large models and ultra-long inputs. 3. Interplay with RoPE extensions, retrieval-style sparsity, or SSM hybrids is discussed but not systematically tested. 4. Robustness beyond NAPE and specific hyperparameters is unclear, so portability to other position encodings o
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAnomaly Detection Techniques and Applications · Image Retrieval and Classification Techniques · Time Series Analysis and Forecasting
MethodsSoftmax · Focus
