Effects of momentum scaling for SGD

Dmitry A. Pasechnyuk; Alexander Gasnikov; Martin Tak\'a\v{c}

arXiv:2210.11869·math.OC·October 24, 2022

Effects of momentum scaling for SGD

Dmitry A. Pasechnyuk, Alexander Gasnikov, Martin Tak\'a\v{c}

PDF

Open Access

TL;DR

This paper analyzes how momentum scaling in stochastic gradient descent with preconditioning affects convergence, revealing that proper scaling can eliminate dependence on the Lipschitz constant and proposing adaptive formulas for parameters.

Contribution

It provides a convergence analysis of momentum-scaled preconditioned SGD and introduces explicit adaptive formulas for momentum coefficient and step size.

Findings

01

Scaling removes dependence on Lipschitz constant in convergence rates

02

Proper choice of momentum coefficient $eta$ is crucial for efficiency

03

Proposed adaptive formulas improve parameter tuning

Abstract

The paper studies the properties of stochastic gradient methods with preconditioning. We focus on momentum updated preconditioners with momentum coefficient $β$ . Seeking to explain practical efficiency of scaled methods, we provide convergence analysis in a norm associated with preconditioner, and demonstrate that scaling allows one to get rid of gradients Lipschitz constant in convergence rates. Along the way, we emphasize important role of $β$ , undeservedly set to constant $0.99...9$ at the arbitrariness of various authors. Finally, we propose the explicit constructive formulas for adaptive $β$ and step size values.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStochastic Gradient Optimization Techniques · Markov Chains and Monte Carlo Methods · Matrix Theory and Algorithms