Nested Gradient Codes for Straggler Mitigation in Distributed Machine   Learning

Luis Ma{\ss}ny; Christoph Hofmeister; Maximilian Egger; Rawad Bitar,; Antonia Wachter-Zeh

arXiv:2212.08580·cs.IT·December 19, 2022

Nested Gradient Codes for Straggler Mitigation in Distributed Machine Learning

Luis Ma{\ss}ny, Christoph Hofmeister, Maximilian Egger, Rawad Bitar,, Antonia Wachter-Zeh

PDF

Open Access

TL;DR

This paper introduces a flexible gradient coding scheme for distributed machine learning that adapts to the actual number of stragglers, reducing latency compared to fixed-straggler-tolerance codes.

Contribution

It proposes a novel concatenated gradient coding scheme that dynamically adjusts to straggler variability, improving efficiency in distributed learning.

Findings

01

Lower latency compared to traditional gradient codes

02

Adaptive scheme effectively handles variable straggler counts

03

Minimal additional signaling required for adaptation

Abstract

We consider distributed learning in the presence of slow and unresponsive worker nodes, referred to as stragglers. In order to mitigate the effect of stragglers, gradient coding redundantly assigns partial computations to the worker such that the overall result can be recovered from only the non-straggling workers. Gradient codes are designed to tolerate a fixed number of stragglers. Since the number of stragglers in practice is random and unknown a priori, tolerating a fixed number of stragglers can yield a sub-optimal computation load and can result in higher latency. We propose a gradient coding scheme that can tolerate a flexible number of stragglers by carefully concatenating gradient codes for different straggler tolerance. By proper task scheduling and small additional signaling, our scheme adapts the computation load of the workers to the actual number of stragglers. We analyze…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStochastic Gradient Optimization Techniques · Privacy-Preserving Technologies in Data · Wireless Communication Security Techniques