Sequential Gradient Coding For Straggler Mitigation
M. Nikhil Krishnan, MohammadReza Ebrahimi, Ashish Khisti

TL;DR
This paper introduces two advanced gradient coding schemes that leverage temporal information and selective repetition to mitigate stragglers more effectively in distributed neural network training, achieving significant runtime improvements.
Contribution
The main contribution is a novel gradient coding scheme that combines coding and repetition, exploiting temporal dynamics for better straggler mitigation in distributed computing.
Findings
Achieved up to 16% reduction in runtime over baseline GC.
Demonstrated effectiveness in a practical AWS Lambda cluster setting.
Improved straggler mitigation through adaptive task multiplexing.
Abstract
In distributed computing, slower nodes (stragglers) usually become a bottleneck. Gradient Coding (GC), introduced by Tandon et al., is an efficient technique that uses principles of error-correcting codes to distribute gradient computation in the presence of stragglers. In this paper, we consider the distributed computation of a sequence of gradients , where processing of each gradient starts in round- and finishes by round-. Here denotes a delay parameter. For the GC scheme, coding is only across computing nodes and this results in a solution where . On the other hand, having allows for designing schemes which exploit the temporal dimension as well. In this work, we propose two schemes that demonstrate improved performance compared to GC. Our first scheme combines GC with selective repetition of previously unfinished…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Privacy-Preserving Technologies in Data · Ferroelectric and Negative Capacitance Devices
