Training Dynamics of the Cooldown Stage in Warmup-Stable-Decay Learning Rate Scheduler

Aleksandr Dremov; Alexander H\"agele; Atli Kosson; Martin Jaggi

arXiv:2508.01483·cs.LG·August 8, 2025

Training Dynamics of the Cooldown Stage in Warmup-Stable-Decay Learning Rate Scheduler

Aleksandr Dremov, Alexander H\"agele, Atli Kosson, Martin Jaggi

PDF

TL;DR

This paper analyzes the cooldown phase in the Warmup-Stable-Decay learning rate scheduler for transformer training, revealing how cooldown shape influences bias-variance trade-offs and model performance, with practical tuning insights.

Contribution

It provides the first comprehensive analysis of the cooldown phase in WSD learning rate scheduling, highlighting its impact on bias-variance trade-offs and offering practical tuning recommendations.

Findings

01

Cooldown shape affects bias-variance trade-off and model performance.

02

Higher $eta_2$ values during cooldown improve results.

03

Visualizations support the river valley loss landscape during cooldown.

Abstract

Learning rate scheduling is essential in transformer training, where the final annealing plays a crucial role in getting the best performance. However, the mechanisms behind this cooldown phase, with its characteristic drop in loss, remain poorly understood. To address this, we provide a comprehensive analysis focusing solely on the cooldown phase in the Warmup-Stable-Decay (WSD) learning rate scheduler. Our analysis reveals that different cooldown shapes reveal a fundamental bias-variance trade-off in the resulting models, with shapes that balance exploration and exploitation consistently outperforming alternatives. Similarly, we find substantial performance variations $\unicode x 2013$ comparable to those from cooldown shape selection $\unicode x 2013$ when tuning AdamW hyperparameters. Notably, we observe consistent improvements with higher values of $β_{2}$ during cooldown. From a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.