On the Stability of Nonlinear Dynamics in GD and SGD: Beyond Quadratic Potentials

Rotem Mulayoff; Sebastian U. Stich

arXiv:2602.14789·cs.LG·February 17, 2026

On the Stability of Nonlinear Dynamics in GD and SGD: Beyond Quadratic Potentials

Rotem Mulayoff, Sebastian U. Stich

PDF

Open Access 3 Reviews

TL;DR

This paper investigates the nonlinear stability of gradient descent and stochastic gradient descent, revealing that nonlinear effects can cause divergence even when linear analysis suggests stability, and providing exact criteria for stability.

Contribution

It derives an exact nonlinear stability criterion for GD and extends the analysis to SGD, highlighting the limitations of linear stability analysis in nonlinear settings.

Findings

01

Nonlinear terms can cause divergence in SGD even if linear analysis indicates stability.

02

GD can stably oscillate near linearly unstable minima, challenging linear stability assumptions.

03

If all batches are linearly stable, SGD's nonlinear dynamics are stable in expectation.

Abstract

The dynamical stability of the iterates during training plays a key role in determining the minima obtained by optimization algorithms. For example, stable solutions of gradient descent (GD) correspond to flat minima, which have been associated with favorable features. While prior work often relies on linearization to determine stability, it remains unclear whether linearized dynamics faithfully capture the full nonlinear behavior. Recent work has shown that GD may stably oscillate near a linearly unstable minimum and still converge once the step size decays, indicating that linear analysis can be misleading. In this work, we explicitly study the effect of nonlinear terms. Specifically, we derive an exact criterion for stable oscillations of GD near minima in the multivariate setting. Our condition depends on high-order derivatives, generalizing existing results. Extending the analysis…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 4Confidence 3

Strengths

The proofs are technical and seem to be correct (as far as I checked); good structure of the write-up, clearing up the intuition of the proof Good examples illustrating the theorems.

Weaknesses

I have some reservations concerning the contribution of the results of the paper. Since there are no empirical experiments, the theoretical contribution is the only contribution of the paper. - Concerning GD, The fact that we can have stable (period-2) oscillations when we go beyond the stability threshold seems to be very much known in the literature. In particular, Damian et al. “Self-Stabilization…” (2022) exactly uses that mechanism to show the self-stabilization of GD; in particular, that’

Reviewer 02Rating 6Confidence 3

Strengths

- Understanding the training dynamics of gradient descent and stochastic gradient descent is an interesting research question. - The proposed theory explains period-2 cycle dynamics of GD beyond standard linear stability, which appears to be new.

Weaknesses

- The results focus on isolated minima, which differ from the many connected minima typically found in deep learning, as the authors note. - The SGD analysis assumes each batch has its own minimum and that batches are independent, which may be a somewhat strong assumption.

Reviewer 03Rating 0Confidence 5

Strengths

The paper has good examples and gives ideas cleanly.

Weaknesses

### **Previous Work on Stability of SGD** There are a few papers, I believe from Wu et al (2023) and Andreyev and Beneventano (2025) which should be discussed as direct competitor. The former discussing interpolating minima, the latter one being empirical and discussing the fact that more notions of stability are present and exactly picking one that seems to explain the trajectory in neural networks. I understand this paper is not about neural networks but the significance of it is for neural

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStochastic Gradient Optimization Techniques · Neural Networks and Reservoir Computing · Markov Chains and Monte Carlo Methods