TL;DR
Gated DeltaNet-2 introduces a novel attention mechanism that decouples erasing and writing processes, leading to improved performance on long-context benchmarks in language modeling and reasoning tasks.
Contribution
It presents Gated DeltaNet-2, a new linear attention model with separate gates for erasing and writing, enhancing flexibility and performance over prior models.
Findings
Achieves state-of-the-art results on long-context benchmarks.
Improves multi-key retrieval accuracy.
Maintains efficient parallel training with a new backward pass.
Abstract
Linear attention replaces the unbounded cache of softmax attention with a fixed-size recurrent state, reducing sequence mixing to linear time and decoding to constant memory. The hard part is not just what to forget, but how to edit this compressed memory without scrambling existing associations. Delta-rule models subtract the current read before writing a new value, and Kimi Delta Attention (KDA) sharpens forgetting with channel-wise decay. But the active edit still uses a single scalar gate to control two different things: how much old content to erase on the key side and how much new content to commit on the value side. We introduce Gated DeltaNet-2, which generalizes both Gated DeltaNet and KDA by inheriting adaptive forgetting and channel-wise decay while addressing their shared limitation, the scalar tie between erasing and writing. Gated Delta Rule-2 separates these roles with a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
