Gradient Coding

Rashish Tandon; Qi Lei; Alexandros G. Dimakis; Nikos Karampatziakis

arXiv:1612.03301·stat.ML·March 9, 2017·1 cites

Gradient Coding

Rashish Tandon, Qi Lei, Alexandros G. Dimakis, Nikos Karampatziakis

PDF

Open Access 2 Repos

TL;DR

This paper introduces a new coding theoretic approach to reduce delays caused by slow or failing nodes in distributed learning, improving efficiency and robustness.

Contribution

It presents a novel framework that uses data replication and gradient coding to tolerate stragglers in synchronous gradient descent, with practical implementation and evaluation.

Findings

01

Improved running time compared to baseline methods

02

Enhanced robustness to stragglers in distributed training

03

Comparable or better generalization error

Abstract

We propose a novel coding theoretic framework for mitigating stragglers in distributed learning. We show how carefully replicating data blocks and coding across gradients can provide tolerance to failures and stragglers for Synchronous Gradient Descent. We implement our schemes in python (using MPI) to run on Amazon EC2, and show how we compare against baseline approaches in running time and generalization error.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Data Compression Techniques