
TL;DR
The paper introduces Kaczmarz Linear Attention (KLA), a novel linear attention mechanism inspired by the Kaczmarz projection method, which improves language modeling efficiency and accuracy at scale.
Contribution
It derives a key-norm-normalized dynamic step size for residual updates and proposes KLA, a simplified yet effective linear attention model that outperforms previous baselines.
Findings
KLA achieves the lowest validation perplexity among linear-time baselines at 0.4B scale.
KLA reaches 100% accuracy on single-needle-in-a-haystack retrieval tasks.
KLA improves multi-query associative recall by 7.03 points and doubles decode throughput at 32K context.
Abstract
Long-context language modeling remains central to modern sequence modeling, but the quadratic cost of Transformer attention makes scaling computationally prohibitive. Linear recurrent models address this bottleneck by compressing the context into a fixed-size state, making the rule that forgets, writes, and edits information a central design problem. To address state maintenance, Gated DeltaNet (GDN) combines gated state decay with delta-rule residual writes, using a learnable coefficient to balance forgetting and update magnitude. However, this coefficient is learned empirically rather than derived from the underlying objective, which can lead to suboptimal update magnitudes. We revisit the online-regression objective underlying GDN and, inspired by the Kaczmarz projection method, derive the key-norm-normalized dynamic step size for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
