Why Do We Need Weight Decay in Modern Deep Learning?
Francesco D'Angelo, Maksym Andriushchenko, Aditya Varre and, Nicolas Flammarion

TL;DR
This paper investigates the role of weight decay in modern deep learning, revealing it primarily influences training dynamics rather than acting as a regularizer, with effects varying across vision and language models.
Contribution
It provides a unifying perspective on weight decay, showing it modifies optimization dynamics rather than serving as explicit regularization in deep networks.
Findings
Weight decay enhances implicit regularization in vision models trained with SGD.
In large language models, weight decay balances bias-variance tradeoff, improving training stability.
Weight decay's primary role is in training dynamics, not explicit regularization.
Abstract
Weight decay is a broadly used technique for training state-of-the-art deep networks from image classification to large language models. Despite its widespread usage and being extensively studied in the classical literature, its role remains poorly understood for deep learning. In this work, we highlight that the role of weight decay in modern deep learning is different from its regularization effect studied in classical learning theory. For deep networks on vision tasks trained with multipass SGD, we show how weight decay modifies the optimization dynamics enhancing the ever-present implicit regularization of SGD via the loss stabilization mechanism. In contrast, for large language models trained with nearly one-epoch training, we describe how weight decay balances the bias-variance tradeoff in stochastic optimization leading to lower training loss and improved training stability.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Multimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning
MethodsWeight Decay · Stochastic Gradient Descent
