Why is Normalization Preferred? A Worst-Case Complexity Theory for Stochastically Preconditioned SGD under Heavy-Tailed Noise
Yuchen Fang, James Demmel, Javad Lavaei

TL;DR
This paper establishes a worst-case complexity theory for stochastically preconditioned SGD under heavy-tailed noise, showing normalization's advantages over clipping in convergence guarantees and providing insights into empirical training practices.
Contribution
The paper introduces a worst-case complexity framework for SPSGD with heavy-tailed noise, demonstrating normalization's superior convergence properties compared to clipping.
Findings
Normalization guarantees convergence at specific rates under known and unknown parameters.
Clipping may fail to converge in the worst case due to dependence issues.
A novel vector-valued Burkholder-type inequality was developed for analysis.
Abstract
We develop a worst-case complexity theory for stochastically preconditioned stochastic gradient descent (SPSGD) and its accelerated variants under heavy-tailed noise, a setting that encompasses widely used adaptive methods such as Adam, RMSProp, and Shampoo. We assume the stochastic gradient noise has a finite -th moment for some , and measure convergence after iterations. While clipping and normalization are parallel tools for stabilizing training of SGD under heavy-tailed noise, there is a fundamental separation in their worst-case properties in stochastically preconditioned settings. We demonstrate that normalization guarantees convergence to a first-order stationary point at rate when problem parameters are known, and when problem parameters are unknown, matching the optimal rates for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Markov Chains and Monte Carlo Methods · Statistical Methods and Inference
