An Adaptive and Momental Bound Method for Stochastic Learning
Jianbang Ding, Xuancheng Ren, Ruixuan Luo, Xu Sun

TL;DR
This paper introduces AdaMod, an adaptive learning rate method that stabilizes training by bounding learning rates based on exponential moving averages, improving convergence especially on complex neural networks.
Contribution
The paper proposes AdaMod, a novel adaptive learning rate method that prevents excessively large updates, enhancing stability and performance in deep neural network training.
Findings
AdaMod eliminates extremely large learning rates during training.
AdaMod significantly improves training stability on DenseNet and Transformer models.
Experimental results show AdaMod outperforms Adam in complex network training.
Abstract
Training deep neural networks requires intricate initialization and careful selection of learning rates. The emergence of stochastic gradient optimization methods that use adaptive learning rates based on squared past gradients, e.g., AdaGrad, AdaDelta, and Adam, eases the job slightly. However, such methods have also been proven problematic in recent studies with their own pitfalls including non-convergence issues and so on. Alternative variants have been proposed for enhancement, such as AMSGrad, AdaShift and AdaBound. In this work, we identify a new problem of adaptive learning rate methods that exhibits at the beginning of learning where Adam produces extremely large learning rates that inhibit the start of learning. We propose the Adaptive and Momental Bound (AdaMod) method to restrict the adaptive learning rates with adaptive and momental upper bounds. The dynamic learning rate…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Domain Adaptation and Few-Shot Learning · Advanced Neural Network Applications
MethodsLinear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · AdaShift · AdaMod · AdaBound · Batch Normalization · Residual Connection · Convolution · Average Pooling
