Effects of momentum scaling for SGD
Dmitry A. Pasechnyuk, Alexander Gasnikov, Martin Tak\'a\v{c}

TL;DR
This paper analyzes how momentum scaling in stochastic gradient descent with preconditioning affects convergence, revealing that proper scaling can eliminate dependence on the Lipschitz constant and proposing adaptive formulas for parameters.
Contribution
It provides a convergence analysis of momentum-scaled preconditioned SGD and introduces explicit adaptive formulas for momentum coefficient and step size.
Findings
Scaling removes dependence on Lipschitz constant in convergence rates
Proper choice of momentum coefficient $eta$ is crucial for efficiency
Proposed adaptive formulas improve parameter tuning
Abstract
The paper studies the properties of stochastic gradient methods with preconditioning. We focus on momentum updated preconditioners with momentum coefficient . Seeking to explain practical efficiency of scaled methods, we provide convergence analysis in a norm associated with preconditioner, and demonstrate that scaling allows one to get rid of gradients Lipschitz constant in convergence rates. Along the way, we emphasize important role of , undeservedly set to constant at the arbitrariness of various authors. Finally, we propose the explicit constructive formulas for adaptive and step size values.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Markov Chains and Monte Carlo Methods · Matrix Theory and Algorithms
