Better scalability under potentially heavy-tailed gradients
Matthew J. Holland

TL;DR
This paper introduces a scalable alternative to robust gradient descent that effectively handles heavy-tailed gradients without costly aggregation, maintaining robustness and improving scalability in large learning problems.
Contribution
Proposes a simple, scalable method for robust gradient aggregation that avoids costly divergence and maintains formal guarantees, suitable for large-scale, heavy-tailed data scenarios.
Findings
Method is simple to implement and parallelize.
Maintains formal robustness guarantees.
Performs well under both sub-Gaussian and heavy-tailed data.
Abstract
We study a scalable alternative to robust gradient descent (RGD) techniques that can be used when the gradients can be heavy-tailed, though this will be unknown to the learner. The core technique is simple: instead of trying to robustly aggregate gradients at each step, which is costly and leads to sub-optimal dimension dependence in risk bounds, we choose a candidate which does not diverge too far from the majority of cheap stochastic sub-processes run for a single pass over partitioned data. In addition to formal guarantees, we also provide empirical analysis of robustness to perturbations to experimental conditions, under both sub-Gaussian and heavy-tailed data. The result is a procedure that is simple to implement, trivial to parallelize, which keeps the formal strength of RGD methods but scales much better to large learning problems.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Markov Chains and Monte Carlo Methods · Machine Learning and Algorithms
