Gap Aware Mitigation of Gradient Staleness
Saar Barkai, Ido Hakimi, Assaf Schuster

TL;DR
This paper introduces Gap-Aware (GA), a novel asynchronous training method that mitigates gradient staleness in distributed deep learning, improving accuracy and convergence even with many workers in cloud environments.
Contribution
The paper proposes a new staleness measure called Gap and a corresponding penalization method, GA, which enhances asynchronous SGD performance at scale.
Findings
GA outperforms existing gradient penalization methods in accuracy.
GA maintains convergence and benefits from momentum in large-scale asynchronous training.
Theoretical convergence rate for GA is established.
Abstract
Cloud computing is becoming increasingly popular as a platform for distributed training of deep neural networks. Synchronous stochastic gradient descent (SSGD) suffers from substantial slowdowns due to stragglers if the environment is non-dedicated, as is common in cloud computing. Asynchronous SGD (ASGD) methods are immune to these slowdowns but are scarcely used due to gradient staleness, which encumbers the convergence process. Recent techniques have had limited success mitigating the gradient staleness when scaling up to many workers (computing nodes). In this paper we define the Gap as a measure of gradient staleness and propose Gap-Aware (GA), a novel asynchronous-distributed method that penalizes stale gradients linearly to the Gap and performs well even when scaling to large numbers of workers. Our evaluation on the CIFAR, ImageNet, and WikiText-103 datasets shows that GA…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Advanced Neural Network Applications · Privacy-Preserving Technologies in Data
MethodsGenetic Algorithms · Stochastic Gradient Descent
