Stochastic Gradient Methods with Layer-wise Adaptive Moments for   Training of Deep Networks

Boris Ginsburg; Patrice Castonguay; Oleksii Hrinchuk; Oleksii; Kuchaiev; Vitaly Lavrukhin; Ryan Leary; Jason Li; Huyen Nguyen; Yang Zhang,; Jonathan M. Cohen

arXiv:1905.11286·cs.LG·February 10, 2020·88 cites

Stochastic Gradient Methods with Layer-wise Adaptive Moments for Training of Deep Networks

Boris Ginsburg, Patrice Castonguay, Oleksii Hrinchuk, Oleksii, Kuchaiev, Vitaly Lavrukhin, Ryan Leary, Jason Li, Huyen Nguyen, Yang Zhang,, Jonathan M. Cohen

PDF

Open Access 3 Repos

TL;DR

NovoGrad is a new adaptive optimization method for deep learning that normalizes gradients layer-wise, decouples weight decay, and offers robustness, efficiency, and competitive performance across various tasks.

Contribution

It introduces NovoGrad, an adaptive optimizer with layer-wise gradient normalization and decoupled weight decay, improving robustness and efficiency over existing methods.

Findings

01

Performs on par or better than SGD with momentum, Adam, and AdamW.

02

Robust to learning rate and initialization choices.

03

Effective in large batch training with reduced memory footprint.

Abstract

We propose NovoGrad, an adaptive stochastic gradient descent method with layer-wise gradient normalization and decoupled weight decay. In our experiments on neural networks for image classification, speech recognition, machine translation, and language modeling, it performs on par or better than well tuned SGD with momentum and Adam or AdamW. Additionally, NovoGrad (1) is robust to the choice of learning rate and weight initialization, (2) works well in a large batch setting, and (3) has two times smaller memory footprint than Adam.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStochastic Gradient Optimization Techniques · Advanced Neural Network Applications · Domain Adaptation and Few-Shot Learning

MethodsAdamW · SGD with Momentum · Adam · Stochastic Gradient Descent