Adaptive Gradient Method with Resilience and Momentum
Jie Liu, Chen Lin, Chuming Li, Lu Sheng, Ming Sun, Junjie Yan, Wanli, Ouyang

TL;DR
This paper introduces AdaRem, an adaptive gradient method that improves training speed and generalization in deep neural networks by reducing oscillations through a novel parameter-wise learning rate adjustment based on gradient direction consistency.
Contribution
AdaRem is a new adaptive gradient method that encourages long-term consistent updates, leading to faster convergence and better generalization compared to existing methods.
Findings
AdaRem outperforms previous adaptive methods in training speed.
AdaRem achieves lower test error on ImageNet.
Theoretical convergence of AdaRem is established.
Abstract
Several variants of stochastic gradient descent (SGD) have been proposed to improve the learning effectiveness and efficiency when training deep neural networks, among which some recent influential attempts would like to adaptively control the parameter-wise learning rate (e.g., Adam and RMSProp). Although they show a large improvement in convergence speed, most adaptive learning rate methods suffer from compromised generalization compared with SGD. In this paper, we proposed an Adaptive Gradient Method with Resilience and Momentum (AdaRem), motivated by the observation that the oscillations of network parameters slow the training, and give a theoretical proof of convergence. For each parameter, AdaRem adjusts the parameter-wise learning rate according to whether the direction of one parameter changes in the past is aligned with the direction of the current gradient, and thus encourages…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Advanced Neural Network Applications · Domain Adaptation and Few-Shot Learning
MethodsStochastic Gradient Descent · Adam
