Energy-entropy competition and the effectiveness of stochastic gradient descent in machine learning
Yao Zhang, Andrew M. Saxe, Madhu S. Advani, Alpha A. Lee

TL;DR
This paper explores why stochastic gradient descent (SGD) is effective in machine learning by linking it to energy-entropy competition in physics, showing that undersampling biases SGD towards wide minima which generalize better.
Contribution
It establishes a novel connection between parameter inference in machine learning and free energy minimization in statistical physics, explaining SGD's empirical success.
Findings
Wide minima are optimal under undersampling conditions.
Stochasticity biases SGD towards wide minima.
Analytical results for linear neural networks.
Abstract
Finding parameters that minimise a loss function is at the core of many machine learning methods. The Stochastic Gradient Descent algorithm is widely used and delivers state of the art results for many problems. Nonetheless, Stochastic Gradient Descent typically cannot find the global minimum, thus its empirical effectiveness is hitherto mysterious. We derive a correspondence between parameter inference and free energy minimisation in statistical physics. The degree of undersampling plays the role of temperature. Analogous to the energy-entropy competition in statistical physics, wide but shallow minima can be optimal if the system is undersampled, as is typical in many applications. Moreover, we show that the stochasticity in the algorithm has a non-trivial correlation structure which systematically biases it towards wide minima. We illustrate our argument with two prototypical models:…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
