Efficient Distributed Optimization under Heavy-Tailed Noise
Su Hyeong Lee, Manzil Zaheer, Tian Li

TL;DR
This paper introduces TailOPT, a novel distributed optimization framework that effectively handles heavy-tailed stochastic gradient noise, improving training efficiency and performance in large-scale machine learning models.
Contribution
It proposes TailOPT with adaptive clipping techniques, providing convergence guarantees under heavy-tailed noise and introducing a memory-efficient variant, $Bi^2Clip.
Findings
TailOPT outperforms existing methods on language tasks.
$Bi^2Clip achieves adaptive-like performance without extra gradient statistics.
The framework guarantees convergence under unbounded gradient variance.
Abstract
Distributed optimization has become the default training paradigm in modern machine learning due to the growing scale of models and datasets. To mitigate communication overhead, local updates are often applied before global aggregation, resulting in a nested optimization approach with inner and outer steps. However, heavy-tailed stochastic gradient noise remains a significant challenge, particularly in attention-based models, hindering effective training. In this work, we propose TailOPT, an efficient framework designed to address heavy-tailed noise by leveraging adaptive optimization or clipping techniques. We establish convergence guarantees for the TailOPT framework under heavy-tailed noise with potentially unbounded gradient variance and local updates. Among its variants, we highlight a memory and communication efficient instantiation which we call , which performs…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsNeural Networks and Applications · Energy Efficient Wireless Sensor Networks · Distributed Sensor Networks and Detection Algorithms
