Preconditioned Stochastic Gradient Descent

Xi-Lin Li

arXiv:1512.04202·stat.ML·February 23, 2017

Preconditioned Stochastic Gradient Descent

Xi-Lin Li

PDF

2 Repos

TL;DR

This paper introduces a novel preconditioning method for stochastic gradient descent that accelerates convergence, handles both convex and non-convex problems, and reduces the need for tuning, demonstrated on deep neural networks.

Contribution

A new preconditioner estimation technique for SGD that improves convergence speed and stability without complex tuning or problem-specific adjustments.

Findings

01

Preconditioned SGD converges faster on deep neural networks.

02

The method effectively dampens gradient noise in stochastic settings.

03

Experimental results show significant efficiency gains in training complex models.

Abstract

Stochastic gradient descent (SGD) still is the workhorse for many practical problems. However, it converges slow, and can be difficult to tune. It is possible to precondition SGD to accelerate its convergence remarkably. But many attempts in this direction either aim at solving specialized problems, or result in significantly more complicated methods than SGD. This paper proposes a new method to estimate a preconditioner such that the amplitudes of perturbations of preconditioned stochastic gradient match that of the perturbations of parameters to be optimized in a way comparable to Newton method for deterministic optimization. Unlike the preconditioners based on secant equation fitting as done in deterministic quasi-Newton methods, which assume positive definite Hessian and approximate its inverse, the new preconditioner works equally well for both convex and non-convex optimizations…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsStochastic Gradient Descent