TL;DR
This paper introduces a novel preconditioning method for stochastic gradient descent that accelerates convergence, handles both convex and non-convex problems, and reduces the need for tuning, demonstrated on deep neural networks.
Contribution
A new preconditioner estimation technique for SGD that improves convergence speed and stability without complex tuning or problem-specific adjustments.
Findings
Preconditioned SGD converges faster on deep neural networks.
The method effectively dampens gradient noise in stochastic settings.
Experimental results show significant efficiency gains in training complex models.
Abstract
Stochastic gradient descent (SGD) still is the workhorse for many practical problems. However, it converges slow, and can be difficult to tune. It is possible to precondition SGD to accelerate its convergence remarkably. But many attempts in this direction either aim at solving specialized problems, or result in significantly more complicated methods than SGD. This paper proposes a new method to estimate a preconditioner such that the amplitudes of perturbations of preconditioned stochastic gradient match that of the perturbations of parameters to be optimized in a way comparable to Newton method for deterministic optimization. Unlike the preconditioners based on secant equation fitting as done in deterministic quasi-Newton methods, which assume positive definite Hessian and approximate its inverse, the new preconditioner works equally well for both convex and non-convex optimizations…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsStochastic Gradient Descent
