LoLA: Low-Rank Linear Attention With Sparse Caching

Luke McDermott; Robert W. Heath Jr.; Rahul Parhi

arXiv:2505.23666·cs.CL·October 1, 2025

LoLA: Low-Rank Linear Attention With Sparse Caching

Luke McDermott, Robert W. Heath Jr., Rahul Parhi

PDF

Open Access 3 Reviews

TL;DR

LoLA enhances linear attention in transformers by integrating a multi-system memory augmentation, significantly improving long-term recall and performance on various tasks without increasing training complexity.

Contribution

LoLA introduces a training-free memory augmentation for linear attention, boosting associative recall and efficiency in long-context scenarios.

Findings

01

Achieves 97.4% accuracy on pass-key retrieval tasks.

02

Uses 4.6x smaller cache than Llama-3.1 8B.

03

Outperforms other models on zero-shot reasoning.

Abstract

The per-token cost of transformer inference scales with context length, preventing its application to lifelong in-context learning. Linear attention is an efficient alternative that maintains a constant memory footprint, even on infinite context lengths. While this is a potential candidate for lifelong learning, it falls short in memory capacity. In this paper, we propose LoLA, a training-free augmentation to linear attention that boosts associative recall. LoLA distributes past key-value pairs from context into three memory systems: (i) recent pairs in a local sliding window cache; (ii) difficult-to-memorize pairs in a sparse, global cache; and (iii) generic pairs in the recurrent hidden state of linear attention. We show through ablations that our self-recall error metric is crucial to efficiently manage long-term associative memories. On pass-key retrieval tasks, LoLA improves the…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 6Confidence 3

Strengths

1. The paper empirically demonstrates an improvement on tasks where the base model completely fails 2. The propose of a new sparse global cache is novel

Weaknesses

1. The method adds a new computational cost not present in standard linear attention, specifically for scoring and managing the sparse cache. The paper says its overhead is O(λd) which itself needs to be larger for more complex, long-context tasks. 2. LoLA moves away from the simplicity of linear attention by requiring three distinct memory systems that must be managed. 3. This caching strategy cannot fully compensate for the knowledge lost during the base model's efficient distillation from a

Reviewer 02Rating 6Confidence 5

Strengths

1. The paper's core idea is highly original. Instead of using query similarity or softmax scores for sparse attention, it introduces the Self Recall Error (SRE) . This query agnostic metric provides a principled, data driven way to determine which tokens are "difficult to memorize" for the linear state and should be cached in full rank. This is a clever and novel approach to mitigating memory collisions. 2. The claims are supported by exceptionally strong and well targeted experiments. The meth

Weaknesses

1. The paper states the scoring introduces a "small overhead compute cost". However, the proposed algorithm re-scores all $\lambda$ elements in the sparse cache plus the new candidate token(s) at every generation step. This overhead is non trivial, especially when $\lambda$ is large. The Time to First Token (TTFT) in Figure 4 confirms this. For a 64 token window, TTFT increases from 0.99s ($\lambda=0$) to 1.46s ($\lambda=512$). This is a roughly 47% slowdown. This trade off is not sufficiently a

Reviewer 03Rating 4Confidence 3

Strengths

The SRE criterion is intuitive and easy to compute given $\Phi(k)$, H, s; the paper supplies pseudo-code and a useful efficiency study (TTFT and VRAM) versus sliding-window size η and sparse-cache size $\lambda$, which practitioners can adopt to tune deployments.

Weaknesses

1. **Positioning / novelty is narrow and tied to a special base model.** Although billed as “training-free,” LoLA **assumes** a specific *subquadratic* base (sliding-window + linear attention) obtained via distillation/LoRA (40M tokens) before LoLA can be used. It is therefore not a drop-in for standard Transformers and the headline “training-free” risks misinterpretation. Please clarify scope and re-title accordingly; also separate the cost/benefit of distillation from LoLA’s cache policy. 2

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Topic Modeling

MethodsAttention Is All You Need · Softmax · Balanced Selection