Gradient Coding with Dynamic Clustering for Straggler Mitigation

Baturalp Buyukates; Emre Ozfatura; Sennur Ulukus; Deniz; Gunduz

arXiv:2011.01922·cs.IT·November 4, 2020

Gradient Coding with Dynamic Clustering for Straggler Mitigation

Baturalp Buyukates, Emre Ozfatura, Sennur Ulukus, Deniz, Gunduz

PDF

TL;DR

This paper introduces GC-DC, a dynamic clustering gradient coding scheme that mitigates stragglers in distributed gradient descent, significantly reducing iteration time without extra communication overhead.

Contribution

The paper proposes a novel gradient coding scheme with dynamic clustering that adapts to straggler behavior, improving efficiency in distributed training.

Findings

01

GC-DC reduces average iteration time significantly.

02

No increase in communication load with GC-DC.

03

Effective in time-correlated straggler scenarios.

Abstract

In distributed synchronous gradient descent (GD) the main performance bottleneck for the per-iteration completion time is the slowest \textit{straggling} workers. To speed up GD iterations in the presence of stragglers, coded distributed computation techniques are implemented by assigning redundant computations to workers. In this paper, we propose a novel gradient coding (GC) scheme that utilizes dynamic clustering, denoted by GC-DC, to speed up the gradient calculation. Under time-correlated straggling behavior, GC-DC aims at regulating the number of straggling workers in each cluster based on the straggler behavior in the previous iteration. We numerically show that GC-DC provides significant improvements in the average completion time (of each iteration) with no increase in the communication load compared to the original GC scheme.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.