FG$^2$-GDN: Enhancing Long-Context Gated Delta Networks with Doubly Fine-Grained Control

Pingwei Sun; Yuxuan Hu; Jianchao Tan; Xue Wang; Jiaqi Zhang; Yifan Lu; Yerui Sun; Yuchen Xie; Xunliang Cai

arXiv:2604.19021·cs.LG·May 5, 2026

FG$^2$-GDN: Enhancing Long-Context Gated Delta Networks with Doubly Fine-Grained Control

Pingwei Sun, Yuxuan Hu, Jianchao Tan, Xue Wang, Jiaqi Zhang, Yifan Lu, Yerui Sun, Yuchen Xie, Xunliang Cai

PDF

TL;DR

FG$^2$-GDN introduces channel-wise adaptive learning rates and decoupled scaling for keys and values, significantly enhancing long-context associative recall in linear attention models.

Contribution

The paper proposes FG$^2$-GDN and FG$^2$-GDN+ with dimension-specific control and decoupled scaling, advancing the capabilities of Gated Delta Networks.

Findings

01

Improved associative recall on synthetic benchmarks.

02

Enhanced long-context understanding on real-world tasks.

03

Maintained computational efficiency comparable to prior models.

Abstract

Linear attention mechanisms have emerged as promising alternatives to softmax attention, offering linear-time complexity during inference. Recent advances such as Gated DeltaNet (GDN) and Kimi Delta Attention (KDA) have demonstrated that the delta rule, an online gradient descent update, enables superior associative recall compared to simple additive updates. While KDA refined the coarse head-wise decay gate into channel-wise decay, the learning rate $β_{t}$ in the delta update remains a scalar, limiting the model's capacity for dimension-specific adaptation. We introduce FG $^{2}$ -GDN, which replaces the scalar $β_{t}$ with a channel-wise vector analogous to the transition from SGD to per-coordinate adaptive optimizers such as AdaGrad and Adam. We further propose FG $^{2}$ -GDN+, which decouples the scaling for keys and values, enabling independent control of erasure strength and write…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.