Gradient Coding
Rashish Tandon, Qi Lei, Alexandros G. Dimakis, Nikos Karampatziakis

TL;DR
This paper introduces a new coding theoretic approach to reduce delays caused by slow or failing nodes in distributed learning, improving efficiency and robustness.
Contribution
It presents a novel framework that uses data replication and gradient coding to tolerate stragglers in synchronous gradient descent, with practical implementation and evaluation.
Findings
Improved running time compared to baseline methods
Enhanced robustness to stragglers in distributed training
Comparable or better generalization error
Abstract
We propose a novel coding theoretic framework for mitigating stragglers in distributed learning. We show how carefully replicating data blocks and coding across gradients can provide tolerance to failures and stragglers for Synchronous Gradient Descent. We implement our schemes in python (using MPI) to run on Amazon EC2, and show how we compare against baseline approaches in running time and generalization error.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Data Compression Techniques
