Kaczmarz Linear Attention

Jiaxuan Zou; Ruifeng Ren; Yong Liu

arXiv:2605.08587·cs.LG·May 12, 2026

Kaczmarz Linear Attention

Jiaxuan Zou, Ruifeng Ren, Yong Liu

PDF

TL;DR

The paper introduces Kaczmarz Linear Attention (KLA), a novel linear attention mechanism inspired by the Kaczmarz projection method, which improves language modeling efficiency and accuracy at scale.

Contribution

It derives a key-norm-normalized dynamic step size for residual updates and proposes KLA, a simplified yet effective linear attention model that outperforms previous baselines.

Findings

01

KLA achieves the lowest validation perplexity among linear-time baselines at 0.4B scale.

02

KLA reaches 100% accuracy on single-needle-in-a-haystack retrieval tasks.

03

KLA improves multi-query associative recall by 7.03 points and doubles decode throughput at 32K context.

Abstract

Long-context language modeling remains central to modern sequence modeling, but the quadratic cost of Transformer attention makes scaling computationally prohibitive. Linear recurrent models address this bottleneck by compressing the context into a fixed-size state, making the rule that forgets, writes, and edits information a central design problem. To address state maintenance, Gated DeltaNet (GDN) combines gated state decay with delta-rule residual writes, using a learnable coefficient to balance forgetting and update magnitude. However, this coefficient is learned empirically rather than derived from the underlying objective, which can lead to suboptimal update magnitudes. We revisit the online-regression objective underlying GDN and, inspired by the Kaczmarz projection method, derive the key-norm-normalized dynamic step size $β_{t} = η_{t} / (∥ k_{t} ∥_{2}^{2} + ϵ)$ for…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.