On the Superlinear Relationship between SGD Noise Covariance and Loss Landscape Curvature
Yikuan Zhang, Ning Yang, Yuhai Tu

TL;DR
This paper reveals that the relationship between SGD noise covariance and loss landscape curvature is more complex than previously assumed, showing an approximate power-law relation rather than direct proportionality, validated across various deep learning models.
Contribution
It introduces a general relationship between SGD noise covariance and per-sample Hessians, challenging the assumption of direct proportionality to the Hessian in deep neural networks.
Findings
SGD noise covariance approximately commutes with the Hessian.
The diagonal elements follow a power-law relation with exponent between 1 and 2.
Experimental validation across datasets and architectures supports the theoretical bounds.
Abstract
Stochastic Gradient Descent (SGD) introduces anisotropic noise that is correlated with the local curvature of the loss landscape, thereby biasing optimization toward flat minima. Prior work often assumes an equivalence between the Fisher Information Matrix and the Hessian for negative log-likelihood losses, leading to the claim that the SGD noise covariance is proportional to the Hessian . We show that this assumption holds only under restrictive conditions that are typically violated in deep neural networks. Using the recently discovered Activity--Weight Duality, we find a more general relationship agnostic to the specific loss formulation, showing that , where denotes the per-sample Hessian with . As a consequence, and commute…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Adversarial Robustness in Machine Learning · Generative Adversarial Networks and Image Synthesis
