Load is not what you should balance: Introducing Prequal

Bartek Wydrowski; Robert Kleinberg; Stephen M. Rumble; Aaron Archer

arXiv:2312.10172·cs.DC·December 19, 2023·2 cites

Load is not what you should balance: Introducing Prequal

Bartek Wydrowski, Robert Kleinberg, Stephen M. Rumble, Aaron Archer

PDF

Open Access

TL;DR

Prequal is a novel load balancer that reduces request latency in distributed systems by actively probing server latency and active requests, rather than balancing CPU load, leading to improved performance and resource utilization.

Contribution

Prequal introduces a new load balancing approach based on latency and active requests, extending the power-of-d-choices paradigm with asynchronous probing for better latency management.

Findings

01

Significantly reduces tail latency and error rates.

02

Decreases resource usage and improves system utilization.

03

Successfully deployed at YouTube for over two years.

Abstract

We present Prequal (Probing to Reduce Queuing and Latency), a load balancer for distributed multi-tenant systems. Prequal aims to minimize real-time request latency in the presence of heterogeneous server capacities and non-uniform, time-varying antagonist load. It actively probes server load to leverage the power-of-d-choices paradigm, extending it with asynchronous and reusable probes. Cutting against received wisdom, Prequal does not balance CPU load, but instead selects servers according to estimated latency and active requests-in-flight (RIF). We explore its major design features on a testbed system and evaluate it on YouTube, where it has been deployed for more than two years. Prequal has dramatically decreased tail latency, error rates, and resource use, enabling YouTube and other production systems at Google to run at much higher utilization.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsCloud Computing and Resource Management · IoT and Edge/Fog Computing · Distributed systems and fault tolerance