Straggler Mitigation by Delayed Relaunch of Tasks
Mehmet Fatih Aktas, Pei Peng, Emina Soljanin

TL;DR
This paper analyzes the tradeoffs of using redundancy and delayed relaunch strategies for mitigating stragglers in distributed computing, showing that coded redundancy and timely relaunch improve cost and latency.
Contribution
It provides a cost versus latency analysis of redundancy methods, highlighting the benefits of coded redundancy and delayed relaunch in reducing system costs and delays.
Findings
Coded redundancy outperforms simple replication in cost-latency tradeoff.
Delayed relaunch of stragglers significantly reduces cost and latency.
Tail heaviness of task execution times influences redundancy effectiveness.
Abstract
Redundancy for straggler mitigation, originally in data download and more recently in distributed computing context, has been shown to be effective both in theory and practice. Analysis of systems with redundancy has drawn significant attention and numerous papers have studied pain and gain of redundancy under various service models and assumptions on the straggler characteristics. We here present a cost (pain) vs. latency (gain) analysis of using simple replication or erasure coding for straggler mitigation in executing jobs with many tasks. We quantify the effect of the tail of task execution times and discuss tail heaviness as a decisive parameter for the cost and latency of using redundancy. Specifically, we find that coded redundancy achieves better cost vs. latency tradeoff than simple replication and can yield reduction in both cost and latency under less heavy tailed execution…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
