Data Replication for Reducing Computing Time in Distributed Systems with Stragglers
Amir Behrouzi-Far, Emina Soljanin

TL;DR
This paper investigates optimal data replication strategies in distributed systems with stragglers, demonstrating that balanced, disjoint batch replication minimizes job completion time and analyzing the trade-offs between expected time and variance for specific distributions.
Contribution
It characterizes the optimal data replication policy in systems with convex, decreasing job times and derives optimal redundancy levels for exponential distributions.
Findings
Balanced replication minimizes average delay.
Optimal redundancy levels differ for mean and variance.
Trade-off exists between reducing mean and variance of completion time.
Abstract
In distributed computing systems with stragglers, various forms of redundancy can improve the average delay performance. We study the optimal replication of data in systems where the job execution time is a stochastically decreasing and convex random variable. We show that in such systems, the optimum assignment policy is the balanced replication of disjoint batches of data. Furthermore, for Exponential and Shifted-Exponential service times, we derive the optimum redundancy levels for minimizing both expected value and the variance of the job completion time. Our analysis shows that, the optimum redundancy level may not be the same for the two metrics, thus there is a trade-off between reducing the expected value of the completion time and reducing its variance.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
