
TL;DR
This paper investigates the optimal setting of decoupled weight decay in optimization algorithms, challenging previous assumptions and proposing a new perspective that improves training stability and model performance.
Contribution
It derives that decoupled weight decay should be proportional to the square of the learning rate for stability, supported by empirical verification and theoretical analysis.
Findings
Decoupled weight decay proportional to γ² stabilizes weight norms.
Stable weight and gradient norms improve training dynamics.
Optimal effective learning rate transfers across different settings.
Abstract
Decoupled weight decay, solely responsible for the performance advantage of AdamW over Adam, has long been set to proportional to learning rate without questioning. Some researchers have recently challenged such assumption and argued that decoupled weight decay should be set instead based on orthogonality arguments at steady state. To the contrary, we find that eliminating the contribution of the perpendicular component of the update to the weight norm leads to little change to the training dynamics. Instead, we derive that decoupled weight decay results in stable weight norm based on the simple assumption that updates become independent of the weights at steady state, regardless of the nature of the optimizer. Based on the same assumption, we derive and empirically verify that the Total Update Contribution (TUC) of a minibatch under the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
