Stability and Optimization of Speculative Queueing Networks
Jonatha Anselmi, Neil Walton

TL;DR
This paper develops a queueing-theoretic framework for speculative job execution, analyzing its stability, optimization, and performance benefits over traditional load balancing and replication schemes in distributed systems.
Contribution
It introduces a stability analysis for speculative queueing networks, providing conditions for stability, optimal timeout formulas, and comparisons with existing redundancy methods.
Findings
Speculation can expand the stability region of load balancing networks.
Optimal timeout minimizes load and improves system stability.
Under heavy load, speculation outperforms redundancy schemes in response times.
Abstract
We provide a queueing-theoretic framework for job replication schemes based on the principle "\emph{replicate a job as soon as the system detects it as a \emph{straggler}}". This is called job \emph{speculation}. Recent works have analyzed {replication} on arrival, which we refer to as \emph{replication}. Replication is motivated by its implementation in Google's BigTable. However, systems such as Apache Spark and Hadoop MapReduce implement speculative job execution. The performance and optimization of speculative job execution is not well understood. To this end, we propose a queueing network model for load balancing where each server can speculate on the execution time of a job. Specifically, each job is initially assigned to a single server by a frontend dispatcher. Then, when its execution begins, the server sets a timeout. If the job completes before the timeout, it leaves the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
