Optimal Learning-Rate Schedules under Functional Scaling Laws: Power Decay and Warmup-Stable-Decay

Binghui Li; Zilin Wang; Fengling Chen; Shiyang Zhao; Ruiheng Zheng; Lei Wu

arXiv:2602.06797·stat.ML·February 17, 2026

Optimal Learning-Rate Schedules under Functional Scaling Laws: Power Decay and Warmup-Stable-Decay

Binghui Li, Zilin Wang, Fengling Chen, Shiyang Zhao, Ruiheng Zheng, Lei Wu

PDF

Open Access

TL;DR

This paper derives optimal learning-rate schedules under the functional scaling law framework, revealing a phase transition between power decay and warmup-stable-decay regimes, with implications for training efficiency and theoretical guarantees.

Contribution

It introduces a rigorous derivation of optimal LRSs under FSL, characterizes the phase transition, and evaluates practical schedules like cosine decay within this theoretical context.

Findings

01

Power decay schedule optimal in easy tasks regime

02

Warmup-stable-decay schedule optimal in hard tasks regime

03

Last iterate achieves minimax-optimal rate in kernel regression

Abstract

We study optimal learning-rate schedules (LRSs) under the functional scaling law (FSL) framework introduced in Li et al. (2025), which accurately models the loss dynamics of both linear regression and large language model (LLM) pre-training. Within FSL, loss dynamics are governed by two exponents: a source exponent $s > 0$ controlling the rate of signal learning, and a capacity exponent $β > 1$ determining the rate of noise forgetting. Focusing on a fixed training horizon $N$ , we derive the optimal LRSs and reveal a sharp phase transition. In the easy-task regime $s \geq 1 - 1/ β$ , the optimal schedule follows a power decay to zero, $η^{*} (z) = η_{peak} (1 - z / N)^{2 β - 1}$ , where the peak learning rate scales as $η_{peak} ≂ N^{- ν}$ for an explicit exponent $ν = ν (s, β)$ . In contrast, in the hard-task regime $s < 1 - 1/ β$ , the optimal LRS…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDomain Adaptation and Few-Shot Learning · Stochastic Gradient Optimization Techniques · Quantum many-body systems