Conservation Law Breaking at the Edge of Stability: A Spectral Theory of Non-Convex Neural Network Optimization
Daniel Nobrega Medeiros

TL;DR
This paper develops a spectral theory explaining why gradient descent reliably finds good solutions in non-convex neural network training by analyzing conservation laws, spectral properties, and dynamical regimes.
Contribution
It introduces a spectral framework and conservation laws that elucidate the dynamics of gradient descent in non-convex neural networks, validated through extensive experiments.
Findings
Gradient flow preserves L-1 conservation laws confining trajectories.
Drift in gradient descent scales with learning rate and spectral properties.
Exponential spectral compression explains self-regularization in cross-entropy loss.
Abstract
Why does gradient descent reliably find good solutions in non-convex neural network optimization, despite the landscape being NP-hard in the worst case? We show that gradient flow on L-layer ReLU networks without bias preserves L-1 conservation laws C_l = ||W_{l+1}||_F^2 - ||W_l||_F^2, confining trajectories to lower-dimensional manifolds. Under discrete gradient descent, these laws break with total drift scaling as eta^alpha where alpha is approximately 1.1-1.6 depending on architecture, loss function, and width. We decompose this drift exactly as eta^2 * S(eta), where the gradient imbalance sum S(eta) admits a closed-form spectral crossover formula with mode coefficients c_k proportional to e_k(0)^2 * lambda_{x,k}^2, derived from first principles and validated for both linear (R=0.85) and ReLU (R>0.80) networks. For cross-entropy loss, softmax probability concentration drives…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
