Cosine-Gated Adam-Decay: Drop-In Staleness-Aware Outer Optimization for Decoupled DiLoCo
Vatsal Shah, Jiahao Sun

TL;DR
The paper introduces CGAD, an age-aware optimizer for asynchronous systems that scales pseudo-gradients based on their age, improving stability and convergence in large-scale language model training.
Contribution
CGAD is a novel, drop-in age-aware optimizer that models information decay and extends to partial-sync schedulers, with proven convergence bounds and empirical stability across large models.
Findings
CGAD trains stably across controlled delays in language model pretraining.
The cosine cutoff acts as scale insurance, maintaining low risk at large delays.
Standard Nesterov optimizer is less stable than CGAD in the experiments.
Abstract
Asynchronous DiLoCo systems may receive pseudo-gradients computed several outer rounds earlier, yet the standard Nesterov outer optimizer does not explicitly condition its update on per-update age. This can make the outer momentum buffer brittle under large controlled delays. We propose Cosine Gated Adam Decay (CGAD), a simple, drop-in, age-aware outer optimizer that scales each incoming pseudo-gradient by before it enters Adam's first- and second-moment buffers; the exponential models information decay and the cosine gate smoothly zeroes contributions past a chosen cutoff. CGAD reduces to plain Adam at , adds two hyperparameters whose defaults transfer across scales, and extends to partial-sync schedulers via a per-fragment age-aware variant (PA-CGAD). For an idealized gated-adaptive update on smooth non convex…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
