When Will Gradient Regularization Be Harmful?

Yang Zhao; Hao Zhang; Xiuyuan Hu

arXiv:2406.09723·cs.LG·June 17, 2024

When Will Gradient Regularization Be Harmful?

Yang Zhao, Hao Zhang, Xiuyuan Hu

PDF

Open Access 1 Repo

TL;DR

Gradient regularization can cause instability in adaptive optimizers during early training, but warmup strategies can mitigate these issues and improve model performance, especially in scalable models.

Contribution

This paper identifies the instability caused by gradient regularization in adaptive optimizers and proposes warmup strategies to improve training stability and performance.

Findings

01

Warmup strategies stabilize gradient regularization in adaptive optimizers.

02

Implementing warmup improves performance of Vision Transformers on CIFAR-10.

03

Scalable models benefit more from gradient regularization warmup.

Abstract

Gradient regularization (GR), which aims to penalize the gradient norm atop the loss function, has shown promising results in training modern over-parameterized deep neural networks. However, can we trust this powerful technique? This paper reveals that GR can cause performance degeneration in adaptive optimization scenarios, particularly with learning rate warmup. Our empirical and theoretical analyses suggest this is due to GR inducing instability and divergence in gradient statistics of adaptive optimizers at the initial training stage. Inspired by the warmup heuristic, we propose three GR warmup strategies, each relaxing the regularization effect to a certain extent during the warmup course to ensure the accurate and stable accumulation of gradients. With experiments on Vision Transformer family, we confirm the three GR warmup strategies can effectively circumvent these issues,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

zhaoyang-0204/gnp
jaxOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStochastic Gradient Optimization Techniques

MethodsAttention Is All You Need · Residual Connection · Softmax · Layer Normalization · Byte Pair Encoding · Label Smoothing · Adam · Linear Layer · Multi-Head Attention · Position-Wise Feed-Forward Layer