On the Overlooked Pitfalls of Weight Decay and How to Mitigate Them: A   Gradient-Norm Perspective

Zeke Xie; Zhiqiang Xu; Jingzhao Zhang; Issei Sato; Masashi Sugiyama

arXiv:2011.11152·cs.LG·August 19, 2024·6 cites

On the Overlooked Pitfalls of Weight Decay and How to Mitigate Them: A Gradient-Norm Perspective

Zeke Xie, Zhiqiang Xu, Jingzhao Zhang, Issei Sato, Masashi Sugiyama

PDF

Open Access 1 Repo

TL;DR

This paper identifies that weight decay can cause large gradient norms at training end, leading to poor convergence, and proposes a dynamic scheduler called SWD to mitigate this issue, improving training outcomes.

Contribution

It reveals overlooked pitfalls of weight decay related to gradient norms and introduces a practical scheduler to dynamically adjust weight decay during training.

Findings

01

SWD reduces large gradient norms effectively.

02

SWD outperforms constant weight decay in experiments.

03

Mitigating gradient norms improves convergence and generalization.

Abstract

Weight decay is a simple yet powerful regularization technique that has been very widely used in training of deep neural networks (DNNs). While weight decay has attracted much attention, previous studies fail to discover some overlooked pitfalls on large gradient norms resulted by weight decay. In this paper, we discover that, weight decay can unfortunately lead to large gradient norms at the final phase (or the terminated solution) of training, which often indicates bad convergence and poor generalization. To mitigate the gradient-norm-centered pitfalls, we present the first practical scheduler for weight decay, called the Scheduled Weight Decay (SWD) method that can dynamically adjust the weight decay strength according to the gradient norm and significantly penalize large gradient norms during training. Our experiments also support that SWD indeed mitigates large gradient norms and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

zeke-xie/stable-weight-decay-regularization
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Domain Adaptation and Few-Shot Learning · Machine Learning and ELM