Straggler Mitigation with Tiered Gradient Codes

Shanuja Sasi; V. Lalitha; Vaneet Aggarwal; and B. Sundar Rajan

arXiv:1909.02516·cs.IT·May 20, 2020

Straggler Mitigation with Tiered Gradient Codes

Shanuja Sasi, V. Lalitha, Vaneet Aggarwal, and B. Sundar Rajan

PDF

TL;DR

This paper introduces tiered gradient coding techniques for distributed gradient descent, allowing staggered task initiation to reduce task sizes, improve efficiency, and mitigate stragglers in server systems.

Contribution

It proposes a novel tiered system model that starts with fewer tasks and adds more later, optimizing task sizes and reducing job completion time compared to traditional methods.

Findings

01

Lower task sizes achieved with tiered approach

02

Reduced job completion time due to staggered task start

03

Improved server utilization efficiency

Abstract

Coding theoretic techniques have been proposed for synchronous Gradient Descent (GD) on multiple servers to mitigate stragglers. These techniques provide the flexibility that the job is complete when any $k$ out of $n$ servers finish their assigned tasks. The task size on each server is found based on the values of $k$ and $n$ . However, it is assumed that all the $n$ jobs are started when the job is requested. In contrast, we assume a tiered system, where we start with $n_{1} \geq k$ tasks, and on completion of $c$ tasks, we start $n_{2} - n_{1}$ more tasks. The aim is that as long as $k$ servers can execute their tasks, the job gets completed. This paper exploits the flexibility that not all servers are started at the request time to obtain the achievable task sizes on each server. The task sizes are in general lower than starting all $n_{2}$ tasks at the request times thus helping achieve lower…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.