Is Stochastic Gradient Descent Effective? A PDE Perspective on Machine Learning processes
Davide Barbieri, Matteo Bonforte, Peio Ibarrondo

TL;DR
This paper models stochastic gradient descent (SGD) in machine learning using PDEs, revealing how it concentrates around local minima and escapes suboptimal points, especially in non-convex and degenerate cases.
Contribution
It provides a PDE-based analysis of SGD's dynamics in non-convex settings, including new bounds on escape times and convergence behavior under degenerate diffusion.
Findings
SGD concentrates near local minima in the drift regime.
Stochastic fluctuations enable escape from suboptimal minima.
New bounds on Mean Exit Time for non-convex, degenerate cases.
Abstract
In this paper we analyze the behaviour of the stochastic gradient descent (SGD), a widely used method in supervised learning for optimizing neural network weights via a minimization of non-convex loss functions. Since the pioneering work of E, Li and Tai (2017), the underlying structure of such processes can be understood via parabolic PDEs of Fokker-Planck type, which are at the core of our analysis. Even if Fokker-Planck equations have a long history and a extensive literature, almost nothing is known when the potential is non-convex or when the diffusion matrix is degenerate, and this is the main difficulty that we face in our analysis. We identify two different regimes: in the initial phase of SGD, the loss function drives the weights to concentrate around the nearest local minimum. We refer to this phase as the drift regime and we provide quantitative estimates on this…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStochastic Gradient Optimization Techniques
MethodsStochastic Gradient Descent · Diffusion
