Tackling the Data-Parallel Load Balancing Bottleneck in LLM Serving: Practical Online Routing at Scale

Tianci Bu; Yuan Lyu; Zixi Chen; Chendong Song; Hong Liang; Tsepten Gurung; Yuwei Fan; Yinyu Ye; Zijie Zhou

arXiv:2605.06113·cs.DC·May 11, 2026

Tackling the Data-Parallel Load Balancing Bottleneck in LLM Serving: Practical Online Routing at Scale

Tianci Bu, Yuan Lyu, Zixi Chen, Chendong Song, Hong Liang, Tsepten Gurung, Yuwei Fan, Yinyu Ye, Zijie Zhou

PDF

TL;DR

This paper introduces BalanceRoute, a family of online routing algorithms designed to mitigate data-parallel load balancing bottlenecks in large-scale LLM serving, significantly improving throughput and reducing imbalance.

Contribution

The paper presents practical online routing algorithms, BR-0 and BR-H, that effectively address load imbalance in LLM serving without requiring complex prediction infrastructure.

Findings

01

BalanceRoute reduces DP imbalance in LLM serving workloads.

02

BalanceRoute improves end-to-end throughput on large-scale clusters.

03

BalanceRoute outperforms vLLM baselines on proprietary and public traces.

Abstract

Data-parallel (DP) load balancing has emerged as a first-order bottleneck in large-scale LLM serving. When a model is sharded across devices via tensor parallelism (TP) or expert parallelism (EP) and replicated across many DP workers, every decode step ends in a synchronization barrier whose latency is set by the most heavily loaded worker; even modest persistent imbalance across DP workers compounds, step after step, into a substantial fraction of wasted compute. The problem is hard for reasons specific to LLM decoding: assignments are sticky (migrating KV caches has a high cost), per-request loads grow over time, arrivals are non-stationary, and the router must decide within a sub-100\,ms decode budget over hundreds of waiting requests and tens of workers. We present \textbf{BalanceRoute}, a family of practical online routing algorithms that target this bottleneck. The first,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.