Leveraging partial stragglers within gradient coding
Aditya Ramamoorthy, Ruoyu Meng, Vrinda S. Girimaji

TL;DR
This paper introduces new gradient coding protocols that effectively utilize partial work from slow or straggling workers in distributed learning, significantly improving efficiency and accuracy over existing methods.
Contribution
The authors develop novel gradient coding protocols that leverage partial stragglers, enhancing computational efficiency and stability compared to traditional approaches.
Findings
Approximately 2x faster exact gradient reconstruction.
Significantly lower mean-squared-error in approximate gradients.
Protocols are computationally and communication efficient.
Abstract
Within distributed learning, workers typically compute gradients on their assigned dataset chunks and send them to the parameter server (PS), which aggregates them to compute either an exact or approximate version of (gradient of the loss function ). However, in large-scale clusters, many workers are slower than their promised speed or even failure-prone. A gradient coding solution introduces redundancy within the assignment of chunks to the workers and uses coding theoretic ideas to allow the PS to recover (exactly or approximately), even in the presence of stragglers. Unfortunately, most existing gradient coding protocols are inefficient from a computation perspective as they coarsely classify workers as operational or failed; the potentially valuable work performed by slow workers (partial stragglers) is ignored. In this work, we present novel gradient coding…
Peer Reviews
Decision·NeurIPS 2024 poster
- The authors study a relevant problem of interest for the ML community. - The paper is well-structured and clearly explains the problem, the proposed solutions. It is easy to read and understand its main points. - The paper introduces innovative gradient coding protocols that leverage the partial contributions of slow workers, addressing a gap in existing methods that typically ignore these partial stragglers. Additionally, the approach of optimizing the order of data chunks within workers is a
- Although the paper is introducing a novel approach -as also mentioned by the authors - a part of this work builds heavily on existing works (long literature on GC as well as [34,35, 37,38]) which limits somewhat the novelty of this work. - The focus of this work is mostly theoretical without providing extensive experiments. Importantly, actual cloud platform statistics on the communication time reduction achieved by the proposed method are missing. This potentially limits the practicality of t
Videos
Taxonomy
TopicsImage Processing Techniques and Applications · Cell Image Analysis Techniques · Optical measurement and interference techniques
