Straggler Mitigation at Scale
Mehmet Fatih Aktas, Emina Soljanin

TL;DR
This paper analyzes the tradeoffs of using redundancy and relaunch strategies to mitigate stragglers in distributed systems, providing quantitative insights into cost and latency impacts based on empirical data.
Contribution
It offers a comprehensive cost versus latency analysis of redundancy and relaunch techniques, including novel expressions and strategies for optimizing distributed job execution.
Findings
Redundancy effectiveness depends on service time variability tail heaviness.
Introducing redundancy after waiting can reduce costs.
Combining redundancy with relaunching improves performance.
Abstract
Runtime performance variability at the servers has been a major issue, hindering the predictable and scalable performance in modern distributed systems. Executing requests or jobs redundantly over multiple servers has been shown to be effective for mitigating variability, both in theory and practice. Systems that employ redundancy has drawn significant attention, and numerous papers have analyzed the pain and gain of redundancy under various service models and assumptions on the runtime variability. This paper presents a cost (pain) vs. latency (gain) analysis of executing jobs of many tasks by employing replicated or erasure coded redundancy. Tail heaviness of service time variability is decisive on the pain and gain of redundancy and we quantify its effect by deriving expressions for the cost and latency. Specifically, we try to answer four questions: 1) How do replicated and coded…
Click any figure to enlarge with its caption.
Figure 1
Figure 2
Figure 3
Figure 4
Figure 5
Figure 6
Figure 7
Figure 8
Figure 9
Figure 10
Figure 11
Figure 12
Figure 13
Figure 14
Figure 15
Figure 16
Figure 17
Figure 18
Figure 19
Figure 20
Figure 21
Figure 22Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
