Training Dynamics of the Cooldown Stage in Warmup-Stable-Decay Learning Rate Scheduler
Aleksandr Dremov, Alexander H\"agele, Atli Kosson, Martin Jaggi

TL;DR
This paper analyzes the cooldown phase in the Warmup-Stable-Decay learning rate scheduler for transformer training, revealing how cooldown shape influences bias-variance trade-offs and model performance, with practical tuning insights.
Contribution
It provides the first comprehensive analysis of the cooldown phase in WSD learning rate scheduling, highlighting its impact on bias-variance trade-offs and offering practical tuning recommendations.
Findings
Cooldown shape affects bias-variance trade-off and model performance.
Higher $eta_2$ values during cooldown improve results.
Visualizations support the river valley loss landscape during cooldown.
Abstract
Learning rate scheduling is essential in transformer training, where the final annealing plays a crucial role in getting the best performance. However, the mechanisms behind this cooldown phase, with its characteristic drop in loss, remain poorly understood. To address this, we provide a comprehensive analysis focusing solely on the cooldown phase in the Warmup-Stable-Decay (WSD) learning rate scheduler. Our analysis reveals that different cooldown shapes reveal a fundamental bias-variance trade-off in the resulting models, with shapes that balance exploration and exploitation consistently outperforming alternatives. Similarly, we find substantial performance variations comparable to those from cooldown shape selection when tuning AdamW hyperparameters. Notably, we observe consistent improvements with higher values of during cooldown. From a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
