Optimal Learning-Rate Schedules under Functional Scaling Laws: Power Decay and Warmup-Stable-Decay
Binghui Li, Zilin Wang, Fengling Chen, Shiyang Zhao, Ruiheng Zheng, Lei Wu

TL;DR
This paper derives optimal learning-rate schedules under the functional scaling law framework, revealing a phase transition between power decay and warmup-stable-decay regimes, with implications for training efficiency and theoretical guarantees.
Contribution
It introduces a rigorous derivation of optimal LRSs under FSL, characterizes the phase transition, and evaluates practical schedules like cosine decay within this theoretical context.
Findings
Power decay schedule optimal in easy tasks regime
Warmup-stable-decay schedule optimal in hard tasks regime
Last iterate achieves minimax-optimal rate in kernel regression
Abstract
We study optimal learning-rate schedules (LRSs) under the functional scaling law (FSL) framework introduced in Li et al. (2025), which accurately models the loss dynamics of both linear regression and large language model (LLM) pre-training. Within FSL, loss dynamics are governed by two exponents: a source exponent controlling the rate of signal learning, and a capacity exponent determining the rate of noise forgetting. Focusing on a fixed training horizon , we derive the optimal LRSs and reveal a sharp phase transition. In the easy-task regime , the optimal schedule follows a power decay to zero, , where the peak learning rate scales as for an explicit exponent . In contrast, in the hard-task regime , the optimal LRS…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDomain Adaptation and Few-Shot Learning · Stochastic Gradient Optimization Techniques · Quantum many-body systems
