Adaptive Communication Strategies to Achieve the Best Error-Runtime   Trade-off in Local-Update SGD

Jianyu Wang; Gauri Joshi

arXiv:1810.08313·cs.LG·March 8, 2019·36 cites

Adaptive Communication Strategies to Achieve the Best Error-Runtime Trade-off in Local-Update SGD

Jianyu Wang, Gauri Joshi

PDF

Open Access

TL;DR

This paper introduces AdaComm, an adaptive communication strategy for distributed SGD that balances error convergence and runtime by adjusting communication frequency, leading to faster training without sacrificing accuracy.

Contribution

The paper proposes AdaComm, a novel adaptive communication method that dynamically adjusts averaging frequency to optimize error convergence and training speed in distributed SGD.

Findings

01

AdaComm reduces training time by 3x compared to fully synchronous SGD.

02

AdaComm achieves the same final training loss with less communication overhead.

03

The strategy effectively balances error convergence and runtime in distributed neural network training.

Abstract

Large-scale machine learning training, in particular distributed stochastic gradient descent, needs to be robust to inherent system variability such as node straggling and random communication delays. This work considers a distributed training framework where each worker node is allowed to perform local model updates and the resulting models are averaged periodically. We analyze the true speed of error convergence with respect to wall-clock time (instead of the number of iterations), and analyze how it is affected by the frequency of averaging. The main contribution is the design of AdaComm, an adaptive communication strategy that starts with infrequent averaging to save communication delay and improve convergence speed, and then increases the communication frequency in order to achieve a low error floor. Rigorous experiments on training deep neural networks show that AdaComm can take…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStochastic Gradient Optimization Techniques · Privacy-Preserving Technologies in Data · Machine Learning and ELM

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · Stochastic Gradient Descent