Approximate Gradient Coding for Heterogeneous Nodes
Amogh Johri, Arti Yardi, and Tejas Bodas

TL;DR
This paper introduces a heterogeneous straggler model for distributed machine learning, proposing a data shuffling method to improve gradient coding efficiency in the presence of slow nodes, validated through simulations and cloud experiments.
Contribution
It models heterogeneous straggler behavior and enhances gradient coding with data shuffling, improving performance in distributed training with slow nodes.
Findings
Data shuffling significantly improves gradient coding performance.
Heterogeneous straggler model better captures real-world node behavior.
Theoretical analysis supports the effectiveness of the proposed approach.
Abstract
In distributed machine learning (DML), the training data is distributed across multiple worker nodes to perform the underlying training in parallel. One major problem affecting the performance of DML algorithms is presence of stragglers. These are nodes that are terribly slow in performing their task which results in under-utilization of the training data that is stored in them. Towards this, gradient coding mitigates the impact of stragglers by adding sufficient redundancy in the data. Gradient coding and other straggler mitigation schemes assume that the straggler behavior of the worker nodes is identical. Our experiments on the Amazon AWS cluster however suggest otherwise and we see that there is a correlation in the straggler behavior across iterations. To model this, we introduce a heterogeneous straggler model where nodes are categorized into two classes, slow and active. To…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
