Scaling Distributed Training with Adaptive Summation
Saeed Maleki, Madan Musuvathi, Todd Mytkowicz, Olli Saarikivi, Tianju, Xu, Vadim Eksarevskiy, Jaliya Ekanayake, Emad Barsoum

TL;DR
This paper introduces Adasum, a gradient summation method that enables faster, more accurate distributed training of deep learning models by respecting the sequential nature of SGD, and demonstrates its scalability and convergence benefits.
Contribution
The paper presents Adasum, a novel gradient combination technique that improves convergence and scalability in distributed SGD, with formal justification and extensive empirical validation.
Findings
Adasum scales Momentum-SGD to 64K examples before communication.
Adasum allows Adam to scale to 64K examples, surpassing prior limits.
Adasum enables LAMB to scale to 128K examples, doubling previous scalability.
Abstract
Stochastic gradient descent (SGD) is an inherently sequential training algorithm--computing the gradient at batch depends on the model parameters learned from batch . Prior approaches that break this dependence do not honor them (e.g., sum the gradients for each batch, which is not what sequential SGD would do) and thus potentially suffer from poor convergence. This paper introduces a novel method to combine gradients called Adasum (for adaptive sum) that converges faster than prior work. Adasum is easy to implement, almost as efficient as simply summing gradients, and is integrated into the open-source toolkit Horovod. This paper first provides a formal justification for Adasum and then empirically demonstrates Adasum is more accurate than prior gradient accumulation methods. It then introduces a series of case-studies to show Adasum works with multiple frameworks,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Advanced Neural Network Applications · Domain Adaptation and Few-Shot Learning
MethodsLAMB · Stochastic Gradient Descent · Adam
