Adaptive Gradient Regularization: A Faster and Generalizable   Optimization Technique for Deep Neural Networks

Huixiu Jiang; Ling Yang; Yu Bao; Rutong Si; Sikun Yang

arXiv:2407.16944·cs.LG·August 21, 2024

Adaptive Gradient Regularization: A Faster and Generalizable Optimization Technique for Deep Neural Networks

Huixiu Jiang, Ling Yang, Yu Bao, Rutong Si, Sikun Yang

PDF

Open Access

TL;DR

This paper introduces Adaptive Gradient Regularization (AGR), a novel optimization technique that adaptively controls gradient descent direction to improve training efficiency and generalization in deep neural networks.

Contribution

It is the first to propose using sum normalization of gradients for adaptive regularization, effectively smoothing the loss landscape and enhancing optimizer performance.

Findings

01

AGR improves training efficiency of AdamW and Adan.

02

AGR enhances model generalization across tasks.

03

AGR is simple to implement with minimal code changes.

Abstract

Stochastic optimization plays a crucial role in the advancement of deep learning technologies. Over the decades, significant effort has been dedicated to improving the training efficiency and robustness of deep neural networks, via various strategies including gradient normalization (GN) and gradient centralization (GC). Nevertheless, to the best of our knowledge, no one has considered to capture the optimal gradient descent trajectory, by adaptively controlling gradient descent direction. To address this concern, this paper is the first attempt to study a new optimization technique for deep neural networks, using the sum normalization of a gradient vector as coefficients, to dynamically regularize gradients and thus to effectively control optimization direction. The proposed technique is hence named as the adaptive gradient regularization (AGR). It can be viewed as an adaptive gradient…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNeural Networks and Applications · Face and Expression Recognition

MethodsAdaptive Nesterov Momentum · Weight Normalization · AdamW · Gradient Normalization · Weight Standardization · Gradient Clipping