Straggler Mitigation with Tiered Gradient Codes
Shanuja Sasi, V. Lalitha, Vaneet Aggarwal, and B. Sundar Rajan

TL;DR
This paper introduces tiered gradient coding techniques for distributed gradient descent, allowing staggered task initiation to reduce task sizes, improve efficiency, and mitigate stragglers in server systems.
Contribution
It proposes a novel tiered system model that starts with fewer tasks and adds more later, optimizing task sizes and reducing job completion time compared to traditional methods.
Findings
Lower task sizes achieved with tiered approach
Reduced job completion time due to staggered task start
Improved server utilization efficiency
Abstract
Coding theoretic techniques have been proposed for synchronous Gradient Descent (GD) on multiple servers to mitigate stragglers. These techniques provide the flexibility that the job is complete when any out of servers finish their assigned tasks. The task size on each server is found based on the values of and . However, it is assumed that all the jobs are started when the job is requested. In contrast, we assume a tiered system, where we start with tasks, and on completion of tasks, we start more tasks. The aim is that as long as servers can execute their tasks, the job gets completed. This paper exploits the flexibility that not all servers are started at the request time to obtain the achievable task sizes on each server. The task sizes are in general lower than starting all tasks at the request times thus helping achieve lower…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
