Alleviating Forgetfulness of Linear Attention by Hybrid Sparse Attention and Contextualized Learnable Token Eviction

Mutian He; Philip N. Garner

arXiv:2510.20787·cs.CL·October 27, 2025

Alleviating Forgetfulness of Linear Attention by Hybrid Sparse Attention and Contextualized Learnable Token Eviction

Mutian He, Philip N. Garner

PDF

8 Models 4 Reviews

TL;DR

This paper proposes hybrid sparse attention methods with learnable token eviction to address forgetfulness in linear attention models, improving retrieval tasks while maintaining efficiency.

Contribution

It introduces a novel learnable token eviction mechanism combined with sliding-window attention, enhancing linear attention models' ability to retain critical information.

Findings

01

Improved retrieval performance on benchmarks.

02

Maintains linear time and space complexity.

03

Provides efficient GPU kernels for sparse attention.

Abstract

Linear-attention models that compress the entire input sequence into a fixed-size recurrent state offer an efficient alternative to Transformers, but their finite memory induces forgetfulness that harms retrieval-intensive tasks. To mitigate the issue, we explore a series of hybrid models that restore direct access to past tokens. We interleave token mixers with intermediate time and space complexity between linear and full attention, including sparse attention with token eviction, and the query-aware native sparse attention. Particularly, we propose a novel learnable token eviction approach. Combined with sliding-window attention, an end-to-end trainable lightweight CNN aggregates information from both past and future adjacent tokens to adaptively retain a limited set of critical KV-pairs per head, maintaining linear attention's constant time and space complexity. Efficient Triton…

Peer Reviews

Decision·ICLR 2026 Conference Withdrawn Submission

Reviewer 01Rating 2Confidence 4

Strengths

1. Per-head, per-token scoring from short local context with grouped 1D convs is simple, parallel, and adds ~1% params. The design is a constant budget and predictable latency. 2. Results are reported on both synthetic (S-NIAH) and realistic (EVAPORATE) retrieval benchmarks, showing consistent gains over pure GDN/GDN+SWA in many settings.

Weaknesses

1. The novelty is limited. The laNSA component is adopted from prior NSA work, and the overall recipe seems an alternation of existing hybrid attention rather than a fundamentally new mechanism. The idea of LTE also sits close to the broader family of token-eviction methods, making the novelty feel incremental. 2. Evidence does not decisively beat the common practice. The improvements on EVAPORATE are modest averages (e.g., laLTE/laNSA only several points over GDN/GDN+SWA), while interleaving fu

Reviewer 02Rating 6Confidence 3

Strengths

### Novelty: This work proposes two complementary mixers interleaved with linear attention—NSA for query-aware sparse access over the full past and LTE for learned keep/evict under a strict cache budget; introduces per-token, per-head retention via a tiny 1D-CNN with SWA-enabled look-ahead and an attention sink to maintain near-constant KV memory; provides deployment-minded decoding/KV design (two-segment cache, lazy batched scoring) and frames a clear accuracy–efficiency Pareto frontier (laLTE

Weaknesses

### Scale and generality: Results are limited to 0.4B/1.4B. It is unclear whether the trends hold for larger, modern LLM families (e.g., Qwen2.5/3, DeepSeek) or for multilingual/code models. ### Benchmark breadth: The evaluation focuses on long-context retrieval. Broader benchmarks commonly used today (e.g., instruction following, math, and code such as AlpacaEval, GSM8K, HumanEval) are absent, making it hard to gauge side effects beyond retrieval. ### Efficiency reporting: The paper argues

Reviewer 03Rating 2Confidence 4

Strengths

- The method is simple to implement.

Weaknesses

## Lacks of novelties The method is somewhat akin to combining existing elements. ## Lacks of efficiency analysis Even if the method is claimed as linear complexity, the linear complexity does not always mean faster and efficient than flash attention. The author lacks a critical analysis of efficiency in real-world hardware. Any latency seconds were not reported. Especially about CNN, the latency analysis is really crucial, since the small size of Conv operation is known to be slower than nor

Reviewer 04Rating 2Confidence 3

Strengths

- Sparse Attention is an important topic for long context models as the attention operation is quadratic - The authors utilize many recent works as a motivation and backbone for their method.

Weaknesses

- L165: The probing step remains quadratic due to the constant setting of the block size $M$. As this is part of the attention computation as a precursor operation, I don't think you claim that the attention computation is a constant $MK$. - How can the method presented in figure 2 ever learn to retain a token in a simple task such as needle-in-a-haystack? For example, datasets like RULER have some tasks where the retrieval target can be any number of key value pairs which is not known until t

Code & Models

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.