OSDN: Improving Delta Rule with Provable Online Preconditioning in Linear Attention

Chenyu Zhou; Hongpei Li; Yuerou Liu; Jianghao Lin; Dongdong Ge; Yinyu Ye

arXiv:2605.13473·cs.LG·May 14, 2026

OSDN: Improving Delta Rule with Provable Online Preconditioning in Linear Attention

Chenyu Zhou, Hongpei Li, Yuerou Liu, Jianghao Lin, Dongdong Ge, Yinyu Ye

PDF

TL;DR

This paper introduces OSDN, an online preconditioning method for linear attention models that enhances in-context recall and scales effectively to billion-parameter models, with proven convergence and adaptive capabilities.

Contribution

The paper proposes OSDN, a novel online preconditioning technique with theoretical guarantees, improving delta rule-based models' performance and scalability in linear attention architectures.

Findings

01

OSDN improves in-context recall by 32% at 340M parameters.

02

At 1.3B parameters, OSDN reduces recall residual ratio by 39%.

03

OSDN maintains performance on downstream tasks like perplexity and LongBench.

Abstract

Linear attention and state-space models offer constant-memory alternatives to softmax attention, but often struggle with in-context associative recall. The Delta Rule mitigates this by writing each token via one step of online gradient descent. However, its step size relies on a single scalar gate that ignores the feature-wise curvature of the inner objective. We propose Online Scaled DeltaNet (OSDN), which augments the scalar gate with a diagonal preconditioner updated online via hypergradient feedback. Crucially, this right-preconditioning is algebraically equivalent to a per-feature scaling of the write-side key. This equivalence allows OSDN to strictly preserve the hardware-friendly chunkwise parallel pipeline of DeltaNet without incurring high-dimensional state overhead. Theoretically, by exploiting the exact-quadratic structure of the inner regression loss, we establish…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.