FG$^2$-GDN: Enhancing Long-Context Gated Delta Networks with Doubly Fine-Grained Control
Pingwei Sun, Yuxuan Hu, Jianchao Tan, Xue Wang, Jiaqi Zhang, Yifan Lu, Yerui Sun, Yuchen Xie, Xunliang Cai

TL;DR
FG$^2$-GDN introduces channel-wise adaptive learning rates and decoupled scaling for keys and values, significantly enhancing long-context associative recall in linear attention models.
Contribution
The paper proposes FG$^2$-GDN and FG$^2$-GDN+ with dimension-specific control and decoupled scaling, advancing the capabilities of Gated Delta Networks.
Findings
Improved associative recall on synthetic benchmarks.
Enhanced long-context understanding on real-world tasks.
Maintained computational efficiency comparable to prior models.
Abstract
Linear attention mechanisms have emerged as promising alternatives to softmax attention, offering linear-time complexity during inference. Recent advances such as Gated DeltaNet (GDN) and Kimi Delta Attention (KDA) have demonstrated that the delta rule, an online gradient descent update, enables superior associative recall compared to simple additive updates. While KDA refined the coarse head-wise decay gate into channel-wise decay, the learning rate in the delta update remains a scalar, limiting the model's capacity for dimension-specific adaptation. We introduce FG-GDN, which replaces the scalar with a channel-wise vector analogous to the transition from SGD to per-coordinate adaptive optimizers such as AdaGrad and Adam. We further propose FG-GDN+, which decouples the scaling for keys and values, enabling independent control of erasure strength and write…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
