Why is Normalization Preferred? A Worst-Case Complexity Theory for Stochastically Preconditioned SGD under Heavy-Tailed Noise

Yuchen Fang; James Demmel; Javad Lavaei

arXiv:2602.13413·cs.LG·February 17, 2026

Why is Normalization Preferred? A Worst-Case Complexity Theory for Stochastically Preconditioned SGD under Heavy-Tailed Noise

Yuchen Fang, James Demmel, Javad Lavaei

PDF

Open Access

TL;DR

This paper establishes a worst-case complexity theory for stochastically preconditioned SGD under heavy-tailed noise, showing normalization's advantages over clipping in convergence guarantees and providing insights into empirical training practices.

Contribution

The paper introduces a worst-case complexity framework for SPSGD with heavy-tailed noise, demonstrating normalization's superior convergence properties compared to clipping.

Findings

01

Normalization guarantees convergence at specific rates under known and unknown parameters.

02

Clipping may fail to converge in the worst case due to dependence issues.

03

A novel vector-valued Burkholder-type inequality was developed for analysis.

Abstract

We develop a worst-case complexity theory for stochastically preconditioned stochastic gradient descent (SPSGD) and its accelerated variants under heavy-tailed noise, a setting that encompasses widely used adaptive methods such as Adam, RMSProp, and Shampoo. We assume the stochastic gradient noise has a finite $p$ -th moment for some $p \in (1, 2]$ , and measure convergence after $T$ iterations. While clipping and normalization are parallel tools for stabilizing training of SGD under heavy-tailed noise, there is a fundamental separation in their worst-case properties in stochastically preconditioned settings. We demonstrate that normalization guarantees convergence to a first-order stationary point at rate $O (T^{- \frac{p - 1}{3 p - 2}})$ when problem parameters are known, and $O (T^{- \frac{p - 1}{2 p}})$ when problem parameters are unknown, matching the optimal rates for…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStochastic Gradient Optimization Techniques · Markov Chains and Monte Carlo Methods · Statistical Methods and Inference