Scale-Invariant Neural Network Optimization: Norm Geometry and Heavy-Tailed Noise
Jiayu Zhang, Tianyi Lin

TL;DR
This paper investigates the theoretical limits and algorithmic strategies for scale-invariant neural network optimization under heavy-tailed noise, emphasizing the role of norm geometry and higher-order smoothness.
Contribution
It provides dimension-dependent lower bounds and matching upper bounds for scale-invariant methods, introduces a transported Scion method leveraging higher-order smoothness, and demonstrates practical effectiveness across architectures.
Findings
Dimension dependence is unavoidable for certain scale-invariant methods with general norms.
A batched Scion method achieves optimal bounds under spectral norm.
Transported Scion improves convergence when higher-order smoothness is exploited.
Abstract
A growing lesson from neural network optimization is that optimizer design should respect how the model is parametrized. Scale-invariant methods become important because their normalized layerwise updates can not only support hyperparameter transfer across model sizes but exploit input-output matrix norm geometry. At the same time, stochastic gradient noises in deep learning are often far from sub-Gaussian and may exhibit heavy tails. These crucial observations have shaped recent algorithmic principles for training neural networks, yet their joint theoretical consequences remain underexplored. In particular, it is unclear what dimension dependence is unavoidable for scale-invariant methods with general input-output matrix norm, and whether higher-order smoothness can accelerate training under heavy-tailed noise. We study these questions through nonconvex smooth stochastic optimization…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
