Nested Gradient Codes for Straggler Mitigation in Distributed Machine Learning
Luis Ma{\ss}ny, Christoph Hofmeister, Maximilian Egger, Rawad Bitar,, Antonia Wachter-Zeh

TL;DR
This paper introduces a flexible gradient coding scheme for distributed machine learning that adapts to the actual number of stragglers, reducing latency compared to fixed-straggler-tolerance codes.
Contribution
It proposes a novel concatenated gradient coding scheme that dynamically adjusts to straggler variability, improving efficiency in distributed learning.
Findings
Lower latency compared to traditional gradient codes
Adaptive scheme effectively handles variable straggler counts
Minimal additional signaling required for adaptation
Abstract
We consider distributed learning in the presence of slow and unresponsive worker nodes, referred to as stragglers. In order to mitigate the effect of stragglers, gradient coding redundantly assigns partial computations to the worker such that the overall result can be recovered from only the non-straggling workers. Gradient codes are designed to tolerate a fixed number of stragglers. Since the number of stragglers in practice is random and unknown a priori, tolerating a fixed number of stragglers can yield a sub-optimal computation load and can result in higher latency. We propose a gradient coding scheme that can tolerate a flexible number of stragglers by carefully concatenating gradient codes for different straggler tolerance. By proper task scheduling and small additional signaling, our scheme adapts the computation load of the workers to the actual number of stragglers. We analyze…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Privacy-Preserving Technologies in Data · Wireless Communication Security Techniques
