Why Do We Need Weight Decay in Modern Deep Learning?

Francesco D'Angelo; Maksym Andriushchenko; Aditya Varre and; Nicolas Flammarion

arXiv:2310.04415·cs.LG·November 6, 2024·6 cites

Why Do We Need Weight Decay in Modern Deep Learning?

Francesco D'Angelo, Maksym Andriushchenko, Aditya Varre and, Nicolas Flammarion

PDF

Open Access 1 Repo

TL;DR

This paper investigates the role of weight decay in modern deep learning, revealing it primarily influences training dynamics rather than acting as a regularizer, with effects varying across vision and language models.

Contribution

It provides a unifying perspective on weight decay, showing it modifies optimization dynamics rather than serving as explicit regularization in deep networks.

Findings

01

Weight decay enhances implicit regularization in vision models trained with SGD.

02

In large language models, weight decay balances bias-variance tradeoff, improving training stability.

03

Weight decay's primary role is in training dynamics, not explicit regularization.

Abstract

Weight decay is a broadly used technique for training state-of-the-art deep networks from image classification to large language models. Despite its widespread usage and being extensively studied in the classical literature, its role remains poorly understood for deep learning. In this work, we highlight that the role of weight decay in modern deep learning is different from its regularization effect studied in classical learning theory. For deep networks on vision tasks trained with multipass SGD, we show how weight decay modifies the optimization dynamics enhancing the ever-present implicit regularization of SGD via the loss stabilization mechanism. In contrast, for large language models trained with nearly one-epoch training, we describe how weight decay balances the bias-variance tradeoff in stochastic optimization leading to lower training loss and improved training stability.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

tml-epfl/why-weight-decay
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Multimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning

MethodsWeight Decay · Stochastic Gradient Descent