Logarithmic-time Schedules for Scaling Language Models with Momentum

Damien Ferbach; Courtney Paquette; Gauthier Gidel; Katie Everett; Elliot Paquette

arXiv:2602.05298·stat.ML·February 19, 2026

Logarithmic-time Schedules for Scaling Language Models with Momentum

Damien Ferbach, Courtney Paquette, Gauthier Gidel, Katie Everett, Elliot Paquette

PDF

Open Access

TL;DR

This paper introduces ADANA, a new optimizer with logarithmic-time schedules for hyperparameters, which improves training efficiency and scalability of large language models by leveraging the data's power-law structure.

Contribution

The paper proposes a novel logarithmic-time scheduling method for optimizer hyperparameters, coupled with damping, resulting in a scalable optimizer that enhances large language model training.

Findings

01

ADANA achieves up to 40% compute efficiency over AdamW.

02

Logarithmic-time scheduling benefits increase with model scale.

03

Weight-decay scheduling alone improves training performance.

Abstract

In practice, the hyperparameters $(β_{1}, β_{2})$ and weight-decay $λ$ in AdamW are typically kept at fixed values. Is there any reason to do otherwise? We show that for large-scale language model training, the answer is yes: by exploiting the power-law structure of language data, one can design time-varying schedules for $(β_{1}, β_{2}, λ)$ that deliver substantial performance gains. We study logarithmic-time scheduling, in which the optimizer's gradient memory horizon grows with training time. Although naive variants of this are unstable, we show that suitable damping mechanisms restore stability while preserving the benefits of longer memory. Based on this, we present ADANA, an AdamW-like optimizer that couples log-time schedules with explicit damping to balance stability and performance. We empirically evaluate ADANA across transformer scalings (45M to 2.6B…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Machine Learning and Data Classification · Natural Language Processing Techniques