How to set AdamW's weight decay as you scale model and dataset size
Xi Wang, Laurence Aitchison

TL;DR
This paper investigates how to properly scale the AdamW weight decay hyperparameter with model and dataset size, revealing that it should increase with model size and decrease with dataset size, based on the EMA timescale analogy.
Contribution
It introduces an EMA-based perspective on AdamW weight decay, providing practical scaling rules validated across multiple models and datasets.
Findings
Optimal EMA timescale is roughly constant across scales.
Weight decay should increase with model size and decrease with dataset size.
Scaling of weight decay is crucial for effective training as models and datasets grow.
Abstract
The scaling of the optimal AdamW weight decay hyperparameter with model and dataset size is critical as we seek to build larger models, but is poorly understood. We show that weights learned by AdamW can be understood as an exponential moving average (EMA) of recent updates. This gives critical insights for how to set the weight decay in AdamW, and how the weight decay should scale with model and dataset size. In particular, the key hyperparameter for an exponential moving average is the EMA timescale. Intuitively, the EMA timescale can be understood as the number of recent iterations the EMA averages over. We find that the optimal timescale, measured in epochs, is roughly constant as we change model and dataset size. Moreover, given a learning rate, there is a one-to-one mapping from the EMA timescale to the weight decay hyperparameter. Thus, if the optimal EMA timescale is constant,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsArtificial Intelligence in Healthcare
MethodsSparse Evolutionary Training · AdamW · Weight Decay · LLaMA
