Tackling the Data-Parallel Load Balancing Bottleneck in LLM Serving: Practical Online Routing at Scale
Tianci Bu, Yuan Lyu, Zixi Chen, Chendong Song, Hong Liang, Tsepten Gurung, Yuwei Fan, Yinyu Ye, Zijie Zhou

TL;DR
This paper introduces BalanceRoute, a family of online routing algorithms designed to mitigate data-parallel load balancing bottlenecks in large-scale LLM serving, significantly improving throughput and reducing imbalance.
Contribution
The paper presents practical online routing algorithms, BR-0 and BR-H, that effectively address load imbalance in LLM serving without requiring complex prediction infrastructure.
Findings
BalanceRoute reduces DP imbalance in LLM serving workloads.
BalanceRoute improves end-to-end throughput on large-scale clusters.
BalanceRoute outperforms vLLM baselines on proprietary and public traces.
Abstract
Data-parallel (DP) load balancing has emerged as a first-order bottleneck in large-scale LLM serving. When a model is sharded across devices via tensor parallelism (TP) or expert parallelism (EP) and replicated across many DP workers, every decode step ends in a synchronization barrier whose latency is set by the most heavily loaded worker; even modest persistent imbalance across DP workers compounds, step after step, into a substantial fraction of wasted compute. The problem is hard for reasons specific to LLM decoding: assignments are sticky (migrating KV caches has a high cost), per-request loads grow over time, arrivals are non-stationary, and the router must decide within a sub-100\,ms decode budget over hundreds of waiting requests and tens of workers. We present \textbf{BalanceRoute}, a family of practical online routing algorithms that target this bottleneck. The first,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
