Asynchronous Heavy-Tailed Optimization
Junfei Sun, Dixi Yao, Xuchen Gong, Tahseen Rabbani, Manzil Zaheer, Tian Li

TL;DR
This paper investigates asynchronous optimization algorithms in the presence of heavy-tailed gradient noise, proposing modifications that improve convergence, delay tolerance, and robustness in training transformer models.
Contribution
It introduces delay-aware learning rate scheduling and delay compensation techniques with theoretical guarantees and empirical improvements over existing methods.
Findings
Convergence rates match synchronous algorithms under heavy-tailed noise.
Enhanced delay tolerance compared to prior asynchronous approaches.
Better accuracy and runtime trade-offs in image and language tasks.
Abstract
Heavy-tailed stochastic gradient noise, commonly observed in transformer models, can destabilize the optimization process. Recent works mainly focus on developing and understanding approaches to address heavy-tailed noise in the centralized or distributed, synchronous setting, leaving the interactions between such noise and asynchronous optimization underexplored. In this work, we investigate two communication schemes that handle stragglers with asynchronous updates in the presence of heavy-tailed gradient noise. We propose and theoretically analyze algorithmic modifications based on delay-aware learning rate scheduling and delay compensation to enhance the performance of asynchronous algorithms. Our convergence guarantees under heavy-tailed noise match the rate of the synchronous counterparts and improve delay tolerance compared with existing asynchronous approaches. Empirically, our…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Advanced Neural Network Applications · Privacy-Preserving Technologies in Data
