Variational Linear Attention: Stable Associative Memory for Long-Context Transformers

Vishal Pandey; Gopal Singh

arXiv:2605.11196·cs.LG·May 13, 2026

Variational Linear Attention: Stable Associative Memory for Long-Context Transformers

Vishal Pandey, Gopal Singh

PDF

TL;DR

This paper introduces Variational Linear Attention (VLA), a novel method that stabilizes memory in long-context transformers, enabling efficient, accurate associative recall with reduced interference and improved speed.

Contribution

VLA reframes memory updates as an online regularised least-squares problem, providing theoretical stability guarantees and practical improvements over existing linear attention methods.

Findings

01

VLA reduces Frobenius norm of memory state by 109× at T=1000.

02

Achieves near-perfect accuracy on associative recall within memory limits.

03

Provides 14× speedup over Python implementation, crossing softmax attention latency at 43,000 tokens.

Abstract

Linear attention reduces the quadratic cost of softmax attention to $O (T)$ , but its memory state grows as $O (T)$ in Frobenius norm, causing progressive interference between stored associations. We introduce \textbf{Variational Linear Attention} (VLA), which reframes the memory update as an online regularised least-squares problem with an adaptive penalty matrix maintained via the Sherman-Morrison rank-1 formula. We prove that normalising the write direction to unit length gives the recurrence Jacobian spectral norm exactly $1$ for all sequence lengths and head dimensions (Proposition 2), and that the state norm is self-limiting under bounded inputs (Proposition 1). Empirically, VLA reduces $∥ S_{t} ∥_{F}$ by $109 \times$ relative to standard linear attention at $T = 1, 000$ , achieves near-perfect exact-match accuracy on multi-query associative recall within the effective…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.