Understanding Decoupled and Early Weight Decay
Johan Bjorck, Kilian Weinberger, Carla Gomes

TL;DR
This paper investigates the effects of weight decay in deep learning, especially when applied only at the start or decoupled from the loss, revealing insights into regularization, generalization, and optimizer behavior across vision, NLP, and RL tasks.
Contribution
It provides a comprehensive analysis of decoupled and early weight decay, demonstrating their impact on network norm, regularization, and optimizer dynamics across different domains.
Findings
Applying WD only at the start keeps network norm small and regularizes training.
Decoupled WD prevents gradient mixing in Adam, aiding hyperparameter tuning.
Traditional generalization metrics fail to capture WD effects, but scale-invariant metrics succeed.
Abstract
Weight decay (WD) is a traditional regularization technique in deep learning, but despite its ubiquity, its behavior is still an area of active research. Golatkar et al. have recently shown that WD only matters at the start of the training in computer vision, upending traditional wisdom. Loshchilov et al. show that for adaptive optimizers, manually decaying weights can outperform adding an penalty to the loss. This technique has become increasingly popular and is referred to as decoupled WD. The goal of this paper is to investigate these two recent empirical observations. We demonstrate that by applying WD only at the start, the network norm stays small throughout training. This has a regularizing effect as the effective gradient updates become larger. However, traditional generalizations metrics fail to capture this effect of WD, and we show how a simple scale-invariant metric…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsHuman Pose and Action Recognition · Machine Learning and Data Classification · Advanced Neural Network Applications
MethodsAdam
