OSDN: Improving Delta Rule with Provable Online Preconditioning in Linear Attention
Chenyu Zhou, Hongpei Li, Yuerou Liu, Jianghao Lin, Dongdong Ge, Yinyu Ye

TL;DR
This paper introduces OSDN, an online preconditioning method for linear attention models that enhances in-context recall and scales effectively to billion-parameter models, with proven convergence and adaptive capabilities.
Contribution
The paper proposes OSDN, a novel online preconditioning technique with theoretical guarantees, improving delta rule-based models' performance and scalability in linear attention architectures.
Findings
OSDN improves in-context recall by 32% at 340M parameters.
At 1.3B parameters, OSDN reduces recall residual ratio by 39%.
OSDN maintains performance on downstream tasks like perplexity and LongBench.
Abstract
Linear attention and state-space models offer constant-memory alternatives to softmax attention, but often struggle with in-context associative recall. The Delta Rule mitigates this by writing each token via one step of online gradient descent. However, its step size relies on a single scalar gate that ignores the feature-wise curvature of the inner objective. We propose Online Scaled DeltaNet (OSDN), which augments the scalar gate with a diagonal preconditioner updated online via hypergradient feedback. Crucially, this right-preconditioning is algebraically equivalent to a per-feature scaling of the write-side key. This equivalence allows OSDN to strictly preserve the hardware-friendly chunkwise parallel pipeline of DeltaNet without incurring high-dimensional state overhead. Theoretically, by exploiting the exact-quadratic structure of the inner regression loss, we establish…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
