On the Variance of the Adaptive Learning Rate and Beyond

Liyuan Liu; Haoming Jiang; Pengcheng He; Weizhu Chen; Xiaodong Liu,; Jianfeng Gao; Jiawei Han

arXiv:1908.03265·cs.LG·October 27, 2021·608 cites

On the Variance of the Adaptive Learning Rate and Beyond

Liyuan Liu, Haoming Jiang, Pengcheng He, Weizhu Chen, Xiaodong Liu,, Jianfeng Gao, Jiawei Han

PDF

Open Access 5 Repos

TL;DR

This paper analyzes the variance issues in adaptive learning rates, explains how warmup heuristics help, and introduces RAdam, a new optimizer that rectifies variance problems, improving training stability and performance.

Contribution

It provides a theoretical understanding of warmup's effectiveness and proposes RAdam, a variant of Adam that addresses variance issues in early training stages.

Findings

01

Warmup reduces variance in adaptive learning rates.

02

RAdam outperforms Adam in various tasks.

03

Theoretical and empirical validation of variance rectification.

Abstract

The learning rate warmup heuristic achieves remarkable success in stabilizing training, accelerating convergence and improving generalization for adaptive stochastic optimization algorithms like RMSprop and Adam. Here, we study its mechanism in details. Pursuing the theory behind warmup, we identify a problem of the adaptive learning rate (i.e., it has problematically large variance in the early stage), suggest warmup works as a variance reduction technique, and provide both empirical and theoretical evidence to verify our hypothesis. We further propose RAdam, a new variant of Adam, by introducing a term to rectify the variance of the adaptive learning rate. Extensive experimental results on image classification, language modeling, and neural machine translation verify our intuition and demonstrate the effectiveness and robustness of our proposed method. All implementations are…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Domain Adaptation and Few-Shot Learning · Stochastic Gradient Optimization Techniques

MethodsAdam · RAdam · RMSProp