A dynamic view of some anomalous phenomena in SGD

Vivek Shripad Borkar

arXiv:2505.01751·math.OC·September 16, 2025

A dynamic view of some anomalous phenomena in SGD

Vivek Shripad Borkar

PDF

TL;DR

This paper offers a new explanation for anomalous phenomena in stochastic gradient descent, such as double descent and grokking, using two time scale stochastic approximation theory applied to gradient dynamics.

Contribution

It introduces a novel perspective on these phenomena by applying two time scale stochastic approximation to continuous-time gradient dynamics.

Findings

01

Provides a plausible explanation for double descent and grokking phenomena.

02

Connects anomalous training behaviors to stochastic approximation theory.

03

Offers insights into the dynamics of over-parametrized neural networks.

Abstract

It has been observed by Belkin et al.\ that over-parametrized neural networks exhibit a `double descent' phenomenon. That is, as the model complexity (as reflected in the number of features) increases, the test error initially decreases, then increases, and then decreases again. A counterpart of this phenomenon in the time domain has been noted in the context of epoch-wise training, viz., the test error decreases with the number of iterates, then increases, then decreases again. Another anomalous phenomenon is that of \textit{grokking} wherein two regimes of descent are interrupted by a third regime wherein the mean loss remains almost constant. This note presents a plausible explanation for these and related phenomena by using the theory of two time scale stochastic approximation, applied to the continuous time limit of the gradient dynamics. This gives a novel perspective for an…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.