Slow and Stale Gradients Can Win the Race: Error-Runtime Trade-offs in   Distributed SGD

Sanghamitra Dutta; Gauri Joshi; Soumyadip Ghosh; Parijat Dube; Priya; Nagpurkar

arXiv:1803.01113·stat.ML·May 11, 2018·27 cites

Slow and Stale Gradients Can Win the Race: Error-Runtime Trade-offs in Distributed SGD

Sanghamitra Dutta, Gauri Joshi, Soumyadip Ghosh, Parijat Dube, Priya, Nagpurkar

PDF

Open Access

TL;DR

This paper analyzes the error-runtime trade-offs in distributed SGD, introducing a new theoretical framework that accounts for random straggler delays and proposes methods to balance convergence speed and staleness effects.

Contribution

It provides a novel theoretical analysis of asynchronous SGD considering random delays and introduces a new learning rate schedule to mitigate gradient staleness effects.

Findings

01

Asynchronous SGD can significantly reduce training time despite gradient staleness.

02

The proposed analysis offers insights into balancing stragglers and staleness for optimal performance.

03

A new learning rate schedule improves convergence in asynchronous settings.

Abstract

Distributed Stochastic Gradient Descent (SGD) when run in a synchronous manner, suffers from delays in waiting for the slowest learners (stragglers). Asynchronous methods can alleviate stragglers, but cause gradient staleness that can adversely affect convergence. In this work we present a novel theoretical characterization of the speed-up offered by asynchronous methods by analyzing the trade-off between the error in the trained model and the actual training runtime (wallclock time). The novelty in our work is that our runtime analysis considers random straggler delays, which helps us design and compare distributed SGD algorithms that strike a balance between stragglers and staleness. We also present a new convergence analysis of asynchronous SGD variants without bounded or exponential delay assumptions, and a novel learning rate schedule to compensate for gradient staleness.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStochastic Gradient Optimization Techniques · Sparse and Compressive Sensing Techniques · Advanced Memory and Neural Computing

MethodsStochastic Gradient Descent