Multiplicative noise and heavy tails in stochastic optimization
Liam Hodgkinson, Michael W. Mahoney

TL;DR
This paper investigates how multiplicative noise in stochastic optimization leads to heavy-tailed parameter distributions, affecting convergence and exploration in neural network training.
Contribution
It provides a theoretical framework linking multiplicative noise to heavy tails and demonstrates its impact across various models and optimizers, supported by empirical evidence.
Findings
Multiplicative noise causes heavy-tailed stationary distributions in parameters.
Heavy tails improve exploration and basin hopping in non-convex optimization.
Results hold for a wide range of models, optimizers, and real neural network training scenarios.
Abstract
Although stochastic optimization is central to modern machine learning, the precise mechanisms underlying its success, and in particular, the precise role of the stochasticity, still remain unclear. Modelling stochastic optimization algorithms as discrete random recurrence relations, we show that multiplicative noise, as it commonly arises due to variance in local rates of convergence, results in heavy-tailed stationary behaviour in the parameters. A detailed analysis is conducted for SGD applied to a simple linear regression problem, followed by theoretical results for a much larger class of models (including non-linear and non-convex) and optimizers (including momentum, Adam, and stochastic Newton), demonstrating that our qualitative results hold much more generally. In each case, we describe dependence on key factors, including step size, batch size, and data variability, all of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Gaussian Processes and Bayesian Inference · Advanced Bandit Algorithms Research
MethodsStochastic Gradient Descent · Adam · Linear Regression
