On the Variance of the Adaptive Learning Rate and Beyond
Liyuan Liu, Haoming Jiang, Pengcheng He, Weizhu Chen, Xiaodong Liu,, Jianfeng Gao, Jiawei Han

TL;DR
This paper analyzes the variance issues in adaptive learning rates, explains how warmup heuristics help, and introduces RAdam, a new optimizer that rectifies variance problems, improving training stability and performance.
Contribution
It provides a theoretical understanding of warmup's effectiveness and proposes RAdam, a variant of Adam that addresses variance issues in early training stages.
Findings
Warmup reduces variance in adaptive learning rates.
RAdam outperforms Adam in various tasks.
Theoretical and empirical validation of variance rectification.
Abstract
The learning rate warmup heuristic achieves remarkable success in stabilizing training, accelerating convergence and improving generalization for adaptive stochastic optimization algorithms like RMSprop and Adam. Here, we study its mechanism in details. Pursuing the theory behind warmup, we identify a problem of the adaptive learning rate (i.e., it has problematically large variance in the early stage), suggest warmup works as a variance reduction technique, and provide both empirical and theoretical evidence to verify our hypothesis. We further propose RAdam, a new variant of Adam, by introducing a term to rectify the variance of the adaptive learning rate. Extensive experimental results on image classification, language modeling, and neural machine translation verify our intuition and demonstrate the effectiveness and robustness of our proposed method. All implementations are…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Domain Adaptation and Few-Shot Learning · Stochastic Gradient Optimization Techniques
MethodsAdam · RAdam · RMSProp
