TL;DR
This paper offers a thermodynamic interpretation of stochastic gradient descent (SGD) in neural network training, framing it as free energy minimization influenced by learning rate and model parameterization.
Contribution
It introduces a novel thermodynamic perspective on SGD, linking learning rate to temperature and explaining convergence behavior in underparameterized and overparameterized models.
Findings
UP models follow free energy minimization with increasing temperature at higher LRs.
OP models' temperature drops to zero at low LRs, leading to direct loss minimization.
The difference is due to the signal-to-noise ratio of stochastic gradients near optima.
Abstract
We present a thermodynamic interpretation of the stationary behavior of stochastic gradient descent (SGD) under fixed learning rates (LRs) in neural network training. We show that SGD implicitly minimizes a free energy function , balancing training loss and the entropy of the weights distribution , with temperature determined by the LR. This perspective offers a new lens on why high LRs prevent training from converging to the loss minima and how different LRs lead to stabilization at different loss levels. We empirically validate the free energy framework on both underparameterized (UP) and overparameterized (OP) models. UP models consistently follow free energy minimization, with temperature increasing monotonically with LR, while for OP models, the temperature effectively drops to zero at low LRs, causing SGD to minimize the loss directly and converge to an optimum.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsStochastic Gradient Descent
