Diversity/Parallelism Trade-off in Distributed Systems with Redundancy
Pei Peng, Emina Soljanin, Philip Whiting

TL;DR
This paper analyzes the trade-off between redundancy and parallelism in distributed systems, identifying optimal strategies to minimize job completion time under resource constraints across various service time distributions.
Contribution
It characterizes the diversity versus parallelism trade-off and determines optimal redundancy strategies for different service time models in distributed computing.
Findings
Different distributions require different redundancy levels for optimal performance.
Optimal strategies vary between replication, coding, and splitting based on service time distribution.
The study provides guidelines for choosing redundancy methods to minimize completion time.
Abstract
As numerous machine learning and other algorithms increase in complexity and data requirements, distributed computing becomes necessary to satisfy the growing computational and storage demands, because it enables parallel execution of smaller tasks that make up a large computing job. However, random fluctuations in task service times lead to straggling tasks with long execution times. Redundancy, in the form of task replication and erasure coding, provides diversity that allows a job to be completed when only a subset of redundant tasks is executed, thus removing the dependency on the straggling tasks. In situations of constrained resources (here a fixed number of parallel servers), increasing redundancy reduces the available resources for parallelism. In this paper, we characterize the diversity vs. parallelism trade-off and identify the optimal strategy, among replication, coding and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDistributed and Parallel Computing Systems · Cloud Computing and Resource Management · Distributed systems and fault tolerance
