On the Superlinear Relationship between SGD Noise Covariance and Loss Landscape Curvature

Yikuan Zhang; Ning Yang; Yuhai Tu

arXiv:2602.05600·cs.LG·February 6, 2026

On the Superlinear Relationship between SGD Noise Covariance and Loss Landscape Curvature

Yikuan Zhang, Ning Yang, Yuhai Tu

PDF

Open Access

TL;DR

This paper reveals that the relationship between SGD noise covariance and loss landscape curvature is more complex than previously assumed, showing an approximate power-law relation rather than direct proportionality, validated across various deep learning models.

Contribution

It introduces a general relationship between SGD noise covariance and per-sample Hessians, challenging the assumption of direct proportionality to the Hessian in deep neural networks.

Findings

01

SGD noise covariance approximately commutes with the Hessian.

02

The diagonal elements follow a power-law relation with exponent between 1 and 2.

03

Experimental validation across datasets and architectures supports the theoretical bounds.

Abstract

Stochastic Gradient Descent (SGD) introduces anisotropic noise that is correlated with the local curvature of the loss landscape, thereby biasing optimization toward flat minima. Prior work often assumes an equivalence between the Fisher Information Matrix and the Hessian for negative log-likelihood losses, leading to the claim that the SGD noise covariance $C$ is proportional to the Hessian $H$ . We show that this assumption holds only under restrictive conditions that are typically violated in deep neural networks. Using the recently discovered Activity--Weight Duality, we find a more general relationship agnostic to the specific loss formulation, showing that $C \propto E_{p} [h_{p}^{2}]$ , where $h_{p}$ denotes the per-sample Hessian with $H = E_{p} [h_{p}]$ . As a consequence, $C$ and $H$ commute…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStochastic Gradient Optimization Techniques · Adversarial Robustness in Machine Learning · Generative Adversarial Networks and Image Synthesis