When Will Gradient Regularization Be Harmful?
Yang Zhao, Hao Zhang, Xiuyuan Hu

TL;DR
Gradient regularization can cause instability in adaptive optimizers during early training, but warmup strategies can mitigate these issues and improve model performance, especially in scalable models.
Contribution
This paper identifies the instability caused by gradient regularization in adaptive optimizers and proposes warmup strategies to improve training stability and performance.
Findings
Warmup strategies stabilize gradient regularization in adaptive optimizers.
Implementing warmup improves performance of Vision Transformers on CIFAR-10.
Scalable models benefit more from gradient regularization warmup.
Abstract
Gradient regularization (GR), which aims to penalize the gradient norm atop the loss function, has shown promising results in training modern over-parameterized deep neural networks. However, can we trust this powerful technique? This paper reveals that GR can cause performance degeneration in adaptive optimization scenarios, particularly with learning rate warmup. Our empirical and theoretical analyses suggest this is due to GR inducing instability and divergence in gradient statistics of adaptive optimizers at the initial training stage. Inspired by the warmup heuristic, we propose three GR warmup strategies, each relaxing the regularization effect to a certain extent during the warmup course to ensure the accurate and stable accumulation of gradients. With experiments on Vision Transformer family, we confirm the three GR warmup strategies can effectively circumvent these issues,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStochastic Gradient Optimization Techniques
MethodsAttention Is All You Need · Residual Connection · Softmax · Layer Normalization · Byte Pair Encoding · Label Smoothing · Adam · Linear Layer · Multi-Head Attention · Position-Wise Feed-Forward Layer
