A dynamic view of some anomalous phenomena in SGD
Vivek Shripad Borkar

TL;DR
This paper offers a new explanation for anomalous phenomena in stochastic gradient descent, such as double descent and grokking, using two time scale stochastic approximation theory applied to gradient dynamics.
Contribution
It introduces a novel perspective on these phenomena by applying two time scale stochastic approximation to continuous-time gradient dynamics.
Findings
Provides a plausible explanation for double descent and grokking phenomena.
Connects anomalous training behaviors to stochastic approximation theory.
Offers insights into the dynamics of over-parametrized neural networks.
Abstract
It has been observed by Belkin et al.\ that over-parametrized neural networks exhibit a `double descent' phenomenon. That is, as the model complexity (as reflected in the number of features) increases, the test error initially decreases, then increases, and then decreases again. A counterpart of this phenomenon in the time domain has been noted in the context of epoch-wise training, viz., the test error decreases with the number of iterates, then increases, then decreases again. Another anomalous phenomenon is that of \textit{grokking} wherein two regimes of descent are interrupted by a third regime wherein the mean loss remains almost constant. This note presents a plausible explanation for these and related phenomena by using the theory of two time scale stochastic approximation, applied to the continuous time limit of the gradient dynamics. This gives a novel perspective for an…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
