LoLA: Low-Rank Linear Attention With Sparse Caching
Luke McDermott, Robert W. Heath Jr., Rahul Parhi

TL;DR
LoLA enhances linear attention in transformers by integrating a multi-system memory augmentation, significantly improving long-term recall and performance on various tasks without increasing training complexity.
Contribution
LoLA introduces a training-free memory augmentation for linear attention, boosting associative recall and efficiency in long-context scenarios.
Findings
Achieves 97.4% accuracy on pass-key retrieval tasks.
Uses 4.6x smaller cache than Llama-3.1 8B.
Outperforms other models on zero-shot reasoning.
Abstract
The per-token cost of transformer inference scales with context length, preventing its application to lifelong in-context learning. Linear attention is an efficient alternative that maintains a constant memory footprint, even on infinite context lengths. While this is a potential candidate for lifelong learning, it falls short in memory capacity. In this paper, we propose LoLA, a training-free augmentation to linear attention that boosts associative recall. LoLA distributes past key-value pairs from context into three memory systems: (i) recent pairs in a local sliding window cache; (ii) difficult-to-memorize pairs in a sparse, global cache; and (iii) generic pairs in the recurrent hidden state of linear attention. We show through ablations that our self-recall error metric is crucial to efficiently manage long-term associative memories. On pass-key retrieval tasks, LoLA improves the…
Peer Reviews
Decision·Submitted to ICLR 2026
1. The paper empirically demonstrates an improvement on tasks where the base model completely fails 2. The propose of a new sparse global cache is novel
1. The method adds a new computational cost not present in standard linear attention, specifically for scoring and managing the sparse cache. The paper says its overhead is O(λd) which itself needs to be larger for more complex, long-context tasks. 2. LoLA moves away from the simplicity of linear attention by requiring three distinct memory systems that must be managed. 3. This caching strategy cannot fully compensate for the knowledge lost during the base model's efficient distillation from a
1. The paper's core idea is highly original. Instead of using query similarity or softmax scores for sparse attention, it introduces the Self Recall Error (SRE) . This query agnostic metric provides a principled, data driven way to determine which tokens are "difficult to memorize" for the linear state and should be cached in full rank. This is a clever and novel approach to mitigating memory collisions. 2. The claims are supported by exceptionally strong and well targeted experiments. The meth
1. The paper states the scoring introduces a "small overhead compute cost". However, the proposed algorithm re-scores all $\lambda$ elements in the sparse cache plus the new candidate token(s) at every generation step. This overhead is non trivial, especially when $\lambda$ is large. The Time to First Token (TTFT) in Figure 4 confirms this. For a 64 token window, TTFT increases from 0.99s ($\lambda=0$) to 1.46s ($\lambda=512$). This is a roughly 47% slowdown. This trade off is not sufficiently a
The SRE criterion is intuitive and easy to compute given $\Phi(k)$, H, s; the paper supplies pseudo-code and a useful efficiency study (TTFT and VRAM) versus sliding-window size η and sparse-cache size $\lambda$, which practitioners can adopt to tune deployments.
1. **Positioning / novelty is narrow and tied to a special base model.** Although billed as “training-free,” LoLA **assumes** a specific *subquadratic* base (sliding-window + linear attention) obtained via distillation/LoRA (40M tokens) before LoLA can be used. It is therefore not a drop-in for standard Transformers and the headline “training-free” risks misinterpretation. Please clarify scope and re-title accordingly; also separate the cost/benefit of distillation from LoLA’s cache policy. 2
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Topic Modeling
MethodsAttention Is All You Need · Softmax · Balanced Selection
