High-dimensional limit theorems for SGD: Momentum and Adaptive Step-sizes
Aukosh Jagannath, Taj Jones-McCormick, Varnan Sarangian

TL;DR
This paper develops a high-dimensional limit framework for SGD with momentum and adaptive step-sizes, revealing how these variants compare to standard SGD and their effects on convergence and stability.
Contribution
It introduces a rigorous high-dimensional scaling limit for SGD variants, analyzing their dynamics and benefits in complex learning problems.
Findings
SGD-M and online SGD have similar limits after rescaling.
Adaptive step-sizes improve convergence to population minima.
High-dimensional effects can degrade performance if step-sizes are not adjusted.
Abstract
We develop a high-dimensional scaling limit for Stochastic Gradient Descent with Polyak Momentum (SGD-M) and adaptive step-sizes. This provides a framework to rigourously compare online SGD with some of its popular variants. We show that the scaling limits of SGD-M coincide with those of online SGD after an appropriate time rescaling and a specific choice of step-size. However, if the step-size is kept the same between the two algorithms, SGD-M will amplify high-dimensional effects, potentially degrading performance relative to online SGD. We demonstrate our framework on two popular learning problems: Spiked Tensor PCA and Single Index Models. In both cases, we also examine online SGD with an adaptive step-size based on normalized gradients. In the high-dimensional regime, this algorithm yields multiple benefits: its dynamics admit fixed points closer to the population minimum and…
Peer Reviews
Decision·ICLR 2026 Poster
The paper is well written and well-organized with clear contributions. I enjoy reading the motivation of the main ideas, and in particular, i find the writing of section 2 on the main result of the paper very informative. Theorem 3.2 is the main theoretical contribution of the work, with the examples in Section 3 to justify the need for the theoretical result. The proofs for the major steps I checked seem correct, and the theoretical statements are reasonable.
I do not have to point out a specific weakness. I like this work and I believe it is worth being accepted to ICLR. Some questions: Typically, the HIGH-DIMENSIONAL LIMIT THEOREMS of an algorithm are purely theoretical results. In this work, the authors show that SGD-M will amplify high-dimensional effects, potentially degrading performance relative to online SGD. Can they comment on what else can be an interesting follow-up to these ideas? What other algorithms might have a similar outcome? I b
1. The paper provides a **valuable generalization** of high-dimensional diffusion limits to include both momentum and adaptive step-size variants of SGD. This extension broadens the understanding of optimization dynamics beyond standard online SGD, offering a unified framework for several widely used algorithms. 2. The main theorems and assumptions are rigorously stated and internally consistent.
1. The manuscript does not clearly highlight its key novelty—the extension from online SGD to momentum and adaptive-step methods. The relation to existing online SGD results is underdeveloped theoretically, which makes the contribution appear incremental despite being substantial. 2. The effective limits of SGD-U is difficult to follow. Neither a general limiting theorem or a comparison to SGD/SGD-M is provided. 3. The paper’s structure could better emphasize the main theoretical message. Sect
I would like to note that I am not a specialist in stochastic differential equations or high-dimensional diffusion limits. While I have done my best to assess the work carefully, my understanding of some of the deeper technical aspects is limited. 1. The paper presents a non-trivial and technically rigorous generalization of existing high-dimensional diffusion limit results to momentum and adaptive-step-size algorithms. Extending the practical dynamics framework of Ben Arous et al. to handle ad
1. The paper is extremely technical and assumes strong familiarity with high-dimensional probability, weak convergence, and diffusion processes. While this is common in this line of work, some readers may find it challenging to connect the formal definitions. For instance, the concepts of delta-localizability and delta-closability are poorly described 2. I am not entirely sure about this claim, but it seems he localizability and closability conditions are mathematically strong (requiring smooth
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Markov Chains and Monte Carlo Methods · Gaussian Processes and Bayesian Inference
