Learning Rate Engineering: From Coarse Single Parameter to Layered Evolution
Ming-Hong Yao, Di Wang, Jian Cui, Jin-Yan Chen, Zi-Hao Cui, Fa Wang, Chen Wei, Qiu-Ye Yu

TL;DR
This paper reviews the evolution of learning rate scheduling from simple global rates to layered, adaptive strategies, proposing a unified framework called DALS and benchmarking multiple approaches across diverse datasets.
Contribution
It systematizes the evolution of learning rate strategies into five generations and introduces DALS, a unified adaptive optimizer integrating multiple scheduling techniques.
Findings
DALS achieves 98.0% accuracy on synthetic tasks.
DALS-Fast reaches 90% accuracy in 3 epochs.
No single strategy outperforms across all regimes.
Abstract
Learning rate scheduling has evolved from the single global fixed rate of early SGD to sophisticated layer-wise adaptive strategies. We systematize this evolution into five generations: (Gen1) global fixed learning rates, (Gen2) global scheduling, (Gen3) parameter-level adaptation, (Gen4) layer-level differentiation, and (Gen5) joint layer-time scheduling. We trace the fundamental motivation behind each transition, showing how the shift from one-size-fits-all to tailoring by layer and time addresses the impossible trinity of transfer learning: lower layers require small updates to preserve general knowledge while higher layers need large updates to adapt to new tasks. Building on this taxonomy, we propose Discriminative Adaptive Layer Scaling (DALS), a unified framework that integrates phase-adaptive cosine scheduling, depth-aware Grokfast gradient filtering, and LARS-style trust ratios…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
