Tackling benign nonconvexity with smoothing and stochastic gradients
Harsh Vardhan, Sebastian U. Stich

TL;DR
This paper demonstrates that perturbed stochastic gradient descent can globally optimize certain non-convex functions by escaping local minima, especially when these functions are close to convex-like structures, explaining SGD's empirical success.
Contribution
The paper introduces a theoretical framework showing global convergence of perturbed SGD on a broad class of non-convex functions near convex-like functions, extending understanding of SGD's effectiveness.
Findings
Perturbed SGD converges to global minima on certain non-convex functions.
SGD can achieve linear convergence when functions are close to convex-like structures.
Standard gradient descent can get stuck in local minima, unlike perturbed SGD.
Abstract
Non-convex optimization problems are ubiquitous in machine learning, especially in Deep Learning. While such complex problems can often be successfully optimized in practice by using stochastic gradient descent (SGD), theoretical analysis cannot adequately explain this success. In particular, the standard analyses do not show global convergence of SGD on non-convex functions, and instead show convergence to stationary points (which can also be local minima or saddle points). We identify a broad class of nonconvex functions for which we can show that perturbed SGD (gradient descent perturbed by stochastic noise -- covering SGD as a special case) converges to a global minimum (or a neighborhood thereof), in contrast to gradient descent without noise that can get stuck in local minima far from a global solution. For example, on non-convex functions that are relatively close to a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Sparse and Compressive Sensing Techniques · Markov Chains and Monte Carlo Methods
MethodsStochastic Gradient Descent
