Decoupled Weight Decay Regularization
Ilya Loshchilov, Frank Hutter

TL;DR
This paper introduces a decoupled weight decay method for adaptive gradient algorithms like Adam, improving their regularization and generalization performance by separating weight decay from the gradient update process.
Contribution
It proposes a simple modification to standard Adam and SGD algorithms to decouple weight decay from gradient updates, enhancing regularization and generalization.
Findings
Decoupled weight decay improves Adam's performance on image classification.
Decoupling allows independent tuning of weight decay and learning rate.
The method is widely adopted and implemented in major ML frameworks.
Abstract
L regularization and weight decay regularization are equivalent for standard stochastic gradient descent (when rescaled by the learning rate), but as we demonstrate this is \emph{not} the case for adaptive gradient algorithms, such as Adam. While common implementations of these algorithms employ L regularization (often calling it "weight decay" in what may be misleading due to the inequivalence we expose), we propose a simple modification to recover the original formulation of weight decay regularization by \emph{decoupling} the weight decay from the optimization steps taken w.r.t. the loss function. We provide empirical evidence that our proposed modification (i) decouples the optimal choice of weight decay factor from the setting of the learning rate for both standard SGD and Adam and (ii) substantially improves Adam's generalization performance, allowing it to compete with…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSparse and Compressive Sensing Techniques
MethodsSGDW · AdamW · SGD with Momentum · Weight Decay · Adam · Stochastic Gradient Descent
