Equilibrated adaptive learning rates for non-convex optimization

Yann N. Dauphin; Harm de Vries; Yoshua Bengio

arXiv:1502.04390·cs.LG·September 1, 2015·152 cites

Equilibrated adaptive learning rates for non-convex optimization

Yann N. Dauphin, Harm de Vries, Yoshua Bengio

PDF

Open Access 2 Repos

TL;DR

This paper introduces ESGD, an adaptive learning rate method based on the equilibration preconditioner, which improves training efficiency for non-convex deep networks by effectively handling saddle points and curvature.

Contribution

The paper proposes a novel adaptive learning rate scheme, ESGD, leveraging the equilibration preconditioner to better address non-convex optimization challenges.

Findings

01

ESGD performs as well or better than RMSProp in convergence speed.

02

Equilibration preconditioner is more suitable than Jacobi in non-convex settings.

03

ESGD consistently improves over plain stochastic gradient descent.

Abstract

Parameter-specific adaptive learning rate methods are computationally efficient ways to reduce the ill-conditioning problems encountered when training large deep networks. Following recent work that strongly suggests that most of the critical points encountered when training such networks are saddle points, we find how considering the presence of negative eigenvalues of the Hessian could help us design better suited adaptive learning rate schemes. We show that the popular Jacobi preconditioner has undesirable behavior in the presence of both positive and negative curvature, and present theoretical and empirical evidence that the so-called equilibration preconditioner is comparatively better suited to non-convex problems. We introduce a novel adaptive learning rate scheme, called ESGD, based on the equilibration preconditioner. Our experiments show that ESGD performs as well or better…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStochastic Gradient Optimization Techniques · Sparse and Compressive Sensing Techniques · Machine Learning and ELM