Stochastic Gradient Methods with Layer-wise Adaptive Moments for Training of Deep Networks
Boris Ginsburg, Patrice Castonguay, Oleksii Hrinchuk, Oleksii, Kuchaiev, Vitaly Lavrukhin, Ryan Leary, Jason Li, Huyen Nguyen, Yang Zhang,, Jonathan M. Cohen

TL;DR
NovoGrad is a new adaptive optimization method for deep learning that normalizes gradients layer-wise, decouples weight decay, and offers robustness, efficiency, and competitive performance across various tasks.
Contribution
It introduces NovoGrad, an adaptive optimizer with layer-wise gradient normalization and decoupled weight decay, improving robustness and efficiency over existing methods.
Findings
Performs on par or better than SGD with momentum, Adam, and AdamW.
Robust to learning rate and initialization choices.
Effective in large batch training with reduced memory footprint.
Abstract
We propose NovoGrad, an adaptive stochastic gradient descent method with layer-wise gradient normalization and decoupled weight decay. In our experiments on neural networks for image classification, speech recognition, machine translation, and language modeling, it performs on par or better than well tuned SGD with momentum and Adam or AdamW. Additionally, NovoGrad (1) is robust to the choice of learning rate and weight initialization, (2) works well in a large batch setting, and (3) has two times smaller memory footprint than Adam.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Advanced Neural Network Applications · Domain Adaptation and Few-Shot Learning
MethodsAdamW · SGD with Momentum · Adam · Stochastic Gradient Descent
