Optimal Server Selection for Straggler Mitigation
Ajay Badita, Parimal Parag, Vaneet Aggarwal

TL;DR
This paper analyzes optimal strategies for server selection in distributed systems to mitigate stragglers, focusing on scheduling forking instants and server addition to minimize completion time and costs.
Contribution
It introduces a multi-fork scheduling approach with optimized forking instants and server numbers based on shifted exponential processing times.
Findings
Significant cost reduction in low completion time regimes
Optimal forking strategies improve job completion times
Insights into scheduling design for distributed systems
Abstract
The performance of large-scale distributed compute systems is adversely impacted by stragglers when the execution time of a job is uncertain. To manage stragglers, we consider a multi-fork approach for job scheduling, where additional parallel servers are added at forking instants. In terms of the forking instants and the number of additional servers, we compute the job completion time and the cost of server utilization when the task processing times are assumed to have a shifted exponential distribution. We use this study to provide insights into the scheduling design of the forking instants and the associated number of additional servers to be started. Numerical results demonstrate orders of magnitude improvement in cost in the regime of low completion times as compared to the prior works.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
