TL;DR
This paper proposes hybrid sparse attention methods with learnable token eviction to address forgetfulness in linear attention models, improving retrieval tasks while maintaining efficiency.
Contribution
It introduces a novel learnable token eviction mechanism combined with sliding-window attention, enhancing linear attention models' ability to retain critical information.
Findings
Improved retrieval performance on benchmarks.
Maintains linear time and space complexity.
Provides efficient GPU kernels for sparse attention.
Abstract
Linear-attention models that compress the entire input sequence into a fixed-size recurrent state offer an efficient alternative to Transformers, but their finite memory induces forgetfulness that harms retrieval-intensive tasks. To mitigate the issue, we explore a series of hybrid models that restore direct access to past tokens. We interleave token mixers with intermediate time and space complexity between linear and full attention, including sparse attention with token eviction, and the query-aware native sparse attention. Particularly, we propose a novel learnable token eviction approach. Combined with sliding-window attention, an end-to-end trainable lightweight CNN aggregates information from both past and future adjacent tokens to adaptively retain a limited set of critical KV-pairs per head, maintaining linear attention's constant time and space complexity. Efficient Triton…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
1. Per-head, per-token scoring from short local context with grouped 1D convs is simple, parallel, and adds ~1% params. The design is a constant budget and predictable latency. 2. Results are reported on both synthetic (S-NIAH) and realistic (EVAPORATE) retrieval benchmarks, showing consistent gains over pure GDN/GDN+SWA in many settings.
1. The novelty is limited. The laNSA component is adopted from prior NSA work, and the overall recipe seems an alternation of existing hybrid attention rather than a fundamentally new mechanism. The idea of LTE also sits close to the broader family of token-eviction methods, making the novelty feel incremental. 2. Evidence does not decisively beat the common practice. The improvements on EVAPORATE are modest averages (e.g., laLTE/laNSA only several points over GDN/GDN+SWA), while interleaving fu
### Novelty: This work proposes two complementary mixers interleaved with linear attention—NSA for query-aware sparse access over the full past and LTE for learned keep/evict under a strict cache budget; introduces per-token, per-head retention via a tiny 1D-CNN with SWA-enabled look-ahead and an attention sink to maintain near-constant KV memory; provides deployment-minded decoding/KV design (two-segment cache, lazy batched scoring) and frames a clear accuracy–efficiency Pareto frontier (laLTE
### Scale and generality: Results are limited to 0.4B/1.4B. It is unclear whether the trends hold for larger, modern LLM families (e.g., Qwen2.5/3, DeepSeek) or for multilingual/code models. ### Benchmark breadth: The evaluation focuses on long-context retrieval. Broader benchmarks commonly used today (e.g., instruction following, math, and code such as AlpacaEval, GSM8K, HumanEval) are absent, making it hard to gauge side effects beyond retrieval. ### Efficiency reporting: The paper argues
- The method is simple to implement.
## Lacks of novelties The method is somewhat akin to combining existing elements. ## Lacks of efficiency analysis Even if the method is claimed as linear complexity, the linear complexity does not always mean faster and efficient than flash attention. The author lacks a critical analysis of efficiency in real-world hardware. Any latency seconds were not reported. Especially about CNN, the latency analysis is really crucial, since the small size of Conv operation is known to be slower than nor
- Sparse Attention is an important topic for long context models as the attention operation is quadratic - The authors utilize many recent works as a motivation and backbone for their method.
- L165: The probing step remains quadratic due to the constant setting of the block size $M$. As this is part of the attention computation as a precursor operation, I don't think you claim that the attention computation is a constant $MK$. - How can the method presented in figure 2 ever learn to retain a token in a simple task such as needle-in-a-haystack? For example, datasets like RULER have some tasks where the retrieval target can be any number of key value pairs which is not known until t
Code & Models
- 🤗Idiap/gated-deltanet-attn-0.4B-10Bmodel· 1 dl1 dl
- 🤗Idiap/gated-deltanet-attn-1.4B-30Bmodel· 32 dl· ♡ 132 dl♡ 1
- 🤗Idiap/gated-deltanet-lte-0.4B-10Bmodel· 24 dl24 dl
- 🤗Idiap/gated-deltanet-lte-1.4B-30Bmodel· 5 dl5 dl
- 🤗Idiap/gated-deltanet-nsa-0.4B-10Bmodel
- 🤗Idiap/gated-deltanet-nsa-1.4B-30Bmodel· 1 dl1 dl
- 🤗Idiap/gated-deltanet-swa-0.4B-10Bmodel· 1 dl1 dl
- 🤗Idiap/gated-deltanet-swa-1.4B-30Bmodel
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
