Variational Linear Attention: Stable Associative Memory for Long-Context Transformers
Vishal Pandey, Gopal Singh

TL;DR
This paper introduces Variational Linear Attention (VLA), a novel method that stabilizes memory in long-context transformers, enabling efficient, accurate associative recall with reduced interference and improved speed.
Contribution
VLA reframes memory updates as an online regularised least-squares problem, providing theoretical stability guarantees and practical improvements over existing linear attention methods.
Findings
VLA reduces Frobenius norm of memory state by 109× at T=1000.
Achieves near-perfect accuracy on associative recall within memory limits.
Provides 14× speedup over Python implementation, crossing softmax attention latency at 43,000 tokens.
Abstract
Linear attention reduces the quadratic cost of softmax attention to , but its memory state grows as in Frobenius norm, causing progressive interference between stored associations. We introduce \textbf{Variational Linear Attention} (VLA), which reframes the memory update as an online regularised least-squares problem with an adaptive penalty matrix maintained via the Sherman-Morrison rank-1 formula. We prove that normalising the write direction to unit length gives the recurrence Jacobian spectral norm exactly for all sequence lengths and head dimensions (Proposition 2), and that the state norm is self-limiting under bounded inputs (Proposition 1). Empirically, VLA reduces by relative to standard linear attention at , achieves near-perfect exact-match accuracy on multi-query associative recall within the effective…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
