Special Properties of Gradient Descent with Large Learning Rates
Amirkeivan Mohtashami, Martin Jaggi, Sebastian Stich

TL;DR
This paper investigates the role of large learning rates in gradient descent, showing they enable convergence to global minima in non-convex optimization, beyond stochastic noise effects.
Contribution
It provides a theoretical framework demonstrating that large step sizes fundamentally alter GD trajectories, leading to better solutions in non-convex problems.
Findings
Large learning rates can lead to convergence to global minima.
Stochastic noise is not the sole factor in SGD success.
Large step size effects are also observed in full-batch GD.
Abstract
When training neural networks, it has been widely observed that a large step size is essential in stochastic gradient descent (SGD) for obtaining superior models. However, the effect of large step sizes on the success of SGD is not well understood theoretically. Several previous works have attributed this success to the stochastic noise present in SGD. However, we show through a novel set of experiments that the stochastic noise is not sufficient to explain good non-convex training, and that instead the effect of a large learning rate itself is essential for obtaining best performance.We demonstrate the same effects also in the noise-less case, i.e. for full-batch GD. We formally prove that GD with large step size -- on certain non-convex function classes -- follows a different trajectory than GD with a small step size, which can lead to convergence to a global minimum instead of a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Machine Learning and Algorithms · Machine Learning and ELM
MethodsStochastic Gradient Descent
