AdaLoss: A computationally-efficient and provably convergent adaptive gradient method
Xiaoxia Wu, Yuege Xie, Simon Du, Rachel Ward

TL;DR
AdaLoss is an adaptive gradient method that uses loss information for stepsize adjustment, achieving provable linear convergence in linear regression and neural networks, with practical applications demonstrated in NLP and control tasks.
Contribution
It introduces AdaLoss, a novel adaptive learning rate schedule with theoretical convergence guarantees for both convex and non-convex models, including neural networks.
Findings
Linear convergence in linear regression.
Global convergence in over-parameterized neural networks.
Effective in practical NLP and control applications.
Abstract
We propose a computationally-friendly adaptive learning rate schedule, "AdaLoss", which directly uses the information of the loss function to adjust the stepsize in gradient descent methods. We prove that this schedule enjoys linear convergence in linear regression. Moreover, we provide a linear convergence guarantee over the non-convex regime, in the context of two-layer over-parameterized neural networks. If the width of the first-hidden layer in the two-layer networks is sufficiently large (polynomially), then AdaLoss converges robustly \emph{to the global minimum} in polynomial time. We numerically verify the theoretical results and extend the scope of the numerical experiments by considering applications in LSTM models for text clarification and policy gradients for control problems.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNeural Networks and Applications · Model Reduction and Neural Networks · Stochastic Gradient Optimization Techniques
MethodsTanh Activation · Sigmoid Activation · Long Short-Term Memory
