AdamP: Slowing Down the Slowdown for Momentum Optimizers on   Scale-invariant Weights

Byeongho Heo; Sanghyuk Chun; Seong Joon Oh; Dongyoon Han; Sangdoo Yun,; Gyuwan Kim; Youngjung Uh; Jung-Woo Ha

arXiv:2006.08217·cs.LG·January 19, 2021·81 cites

AdamP: Slowing Down the Slowdown for Momentum Optimizers on Scale-invariant Weights

Byeongho Heo, Sanghyuk Chun, Seong Joon Oh, Dongyoon Han, Sangdoo Yun,, Gyuwan Kim, Youngjung Uh, Jung-Woo Ha

PDF

Open Access 4 Repos 1 Video

TL;DR

This paper identifies that momentum-based gradient descent optimizers with scale-invariant weights cause premature step size decay, and proposes AdamP and SGDP to mitigate this issue, improving performance across diverse benchmarks.

Contribution

The paper introduces AdamP and SGDP optimizers that address premature step size decay caused by momentum and scale invariance, enhancing training stability and accuracy.

Findings

01

AdamP and SGDP improve performance on 13 benchmarks.

02

The methods stabilize training by maintaining effective step sizes.

03

Uniform gains observed across vision, language, and audio tasks.

Abstract

Normalization techniques are a boon for modern deep learning. They let weights converge more quickly with often better generalization performances. It has been argued that the normalization-induced scale invariance among the weights provides an advantageous ground for gradient descent (GD) optimizers: the effective step sizes are automatically reduced over time, stabilizing the overall training procedure. It is often overlooked, however, that the additional introduction of momentum in GD optimizers results in a far more rapid reduction in effective step sizes for scale-invariant weights, a phenomenon that has not yet been studied and may have caused unwanted side effects in the current practice. This is a crucial issue because arguably the vast majority of modern deep neural networks consist of (1) momentum-based GD (e.g. SGD or Adam) and (2) scale-invariant parameters. In this paper,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

AdamP: Slowing Down the Slowdown for Momentum Optimizers on Scale-invariant Weights· slideslive

Taxonomy

TopicsAdvanced Neural Network Applications · Stochastic Gradient Optimization Techniques · Adversarial Robustness in Machine Learning

MethodsStochastic Gradient Descent · Batch Normalization