Adaptive Gradient Regularization: A Faster and Generalizable Optimization Technique for Deep Neural Networks
Huixiu Jiang, Ling Yang, Yu Bao, Rutong Si, Sikun Yang

TL;DR
This paper introduces Adaptive Gradient Regularization (AGR), a novel optimization technique that adaptively controls gradient descent direction to improve training efficiency and generalization in deep neural networks.
Contribution
It is the first to propose using sum normalization of gradients for adaptive regularization, effectively smoothing the loss landscape and enhancing optimizer performance.
Findings
AGR improves training efficiency of AdamW and Adan.
AGR enhances model generalization across tasks.
AGR is simple to implement with minimal code changes.
Abstract
Stochastic optimization plays a crucial role in the advancement of deep learning technologies. Over the decades, significant effort has been dedicated to improving the training efficiency and robustness of deep neural networks, via various strategies including gradient normalization (GN) and gradient centralization (GC). Nevertheless, to the best of our knowledge, no one has considered to capture the optimal gradient descent trajectory, by adaptively controlling gradient descent direction. To address this concern, this paper is the first attempt to study a new optimization technique for deep neural networks, using the sum normalization of a gradient vector as coefficients, to dynamically regularize gradients and thus to effectively control optimization direction. The proposed technique is hence named as the adaptive gradient regularization (AGR). It can be viewed as an adaptive gradient…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNeural Networks and Applications · Face and Expression Recognition
MethodsAdaptive Nesterov Momentum · Weight Normalization · AdamW · Gradient Normalization · Weight Standardization · Gradient Clipping
