AdaLoss: A computationally-efficient and provably convergent adaptive   gradient method

Xiaoxia Wu; Yuege Xie; Simon Du; Rachel Ward

arXiv:2109.08282·stat.ML·September 20, 2021

AdaLoss: A computationally-efficient and provably convergent adaptive gradient method

Xiaoxia Wu, Yuege Xie, Simon Du, Rachel Ward

PDF

Open Access

TL;DR

AdaLoss is an adaptive gradient method that uses loss information for stepsize adjustment, achieving provable linear convergence in linear regression and neural networks, with practical applications demonstrated in NLP and control tasks.

Contribution

It introduces AdaLoss, a novel adaptive learning rate schedule with theoretical convergence guarantees for both convex and non-convex models, including neural networks.

Findings

01

Linear convergence in linear regression.

02

Global convergence in over-parameterized neural networks.

03

Effective in practical NLP and control applications.

Abstract

We propose a computationally-friendly adaptive learning rate schedule, "AdaLoss", which directly uses the information of the loss function to adjust the stepsize in gradient descent methods. We prove that this schedule enjoys linear convergence in linear regression. Moreover, we provide a linear convergence guarantee over the non-convex regime, in the context of two-layer over-parameterized neural networks. If the width of the first-hidden layer in the two-layer networks is sufficiently large (polynomially), then AdaLoss converges robustly \emph{to the global minimum} in polynomial time. We numerically verify the theoretical results and extend the scope of the numerical experiments by considering applications in LSTM models for text clarification and policy gradients for control problems.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNeural Networks and Applications · Model Reduction and Neural Networks · Stochastic Gradient Optimization Techniques

MethodsTanh Activation · Sigmoid Activation · Long Short-Term Memory