Logarithmic-time Schedules for Scaling Language Models with Momentum
Damien Ferbach, Courtney Paquette, Gauthier Gidel, Katie Everett, Elliot Paquette

TL;DR
This paper introduces ADANA, a new optimizer with logarithmic-time schedules for hyperparameters, which improves training efficiency and scalability of large language models by leveraging the data's power-law structure.
Contribution
The paper proposes a novel logarithmic-time scheduling method for optimizer hyperparameters, coupled with damping, resulting in a scalable optimizer that enhances large language model training.
Findings
ADANA achieves up to 40% compute efficiency over AdamW.
Logarithmic-time scheduling benefits increase with model scale.
Weight-decay scheduling alone improves training performance.
Abstract
In practice, the hyperparameters and weight-decay in AdamW are typically kept at fixed values. Is there any reason to do otherwise? We show that for large-scale language model training, the answer is yes: by exploiting the power-law structure of language data, one can design time-varying schedules for that deliver substantial performance gains. We study logarithmic-time scheduling, in which the optimizer's gradient memory horizon grows with training time. Although naive variants of this are unstable, we show that suitable damping mechanisms restore stability while preserving the benefits of longer memory. Based on this, we present ADANA, an AdamW-like optimizer that couples log-time schedules with explicit damping to balance stability and performance. We empirically evaluate ADANA across transformer scalings (45M to 2.6B…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Machine Learning and Data Classification · Natural Language Processing Techniques
