Adaptive Communication Strategies to Achieve the Best Error-Runtime Trade-off in Local-Update SGD
Jianyu Wang, Gauri Joshi

TL;DR
This paper introduces AdaComm, an adaptive communication strategy for distributed SGD that balances error convergence and runtime by adjusting communication frequency, leading to faster training without sacrificing accuracy.
Contribution
The paper proposes AdaComm, a novel adaptive communication method that dynamically adjusts averaging frequency to optimize error convergence and training speed in distributed SGD.
Findings
AdaComm reduces training time by 3x compared to fully synchronous SGD.
AdaComm achieves the same final training loss with less communication overhead.
The strategy effectively balances error convergence and runtime in distributed neural network training.
Abstract
Large-scale machine learning training, in particular distributed stochastic gradient descent, needs to be robust to inherent system variability such as node straggling and random communication delays. This work considers a distributed training framework where each worker node is allowed to perform local model updates and the resulting models are averaged periodically. We analyze the true speed of error convergence with respect to wall-clock time (instead of the number of iterations), and analyze how it is affected by the frequency of averaging. The main contribution is the design of AdaComm, an adaptive communication strategy that starts with infrequent averaging to save communication delay and improve convergence speed, and then increases the communication frequency in order to achieve a low error floor. Rigorous experiments on training deep neural networks show that AdaComm can take…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Privacy-Preserving Technologies in Data · Machine Learning and ELM
MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · Stochastic Gradient Descent
