Scheduling the Unschedulable: Taming Black-Box LLM Inference at Scale
Renzhong Yuan, Yijun Zeng, Xiaosong Gao, Linxi Yu, Haochun Liao, Han Wang

TL;DR
This paper presents a client-side scheduling framework for black-box LLM inference that improves throughput and deadline satisfaction by decomposing the problem into allocation, ordering, and overload control, with practical experiments and policy comparisons.
Contribution
It introduces a three-layer client-side decomposition for scheduling black-box LLM inference, with experimental validation and analysis of different allocation and overload policies.
Findings
Coarse magnitude priors are crucial for effective client control.
Full stack achieves 100% completion and deadline satisfaction under high congestion.
Fair Queuing improves short-request tail latency compared to FIFO.
Abstract
When output token counts can be predicted at submission time (Gan et al., 2026), client-side scheduling against a black-box LLM API becomes semi-clairvoyant: decisions condition on coarse token priors even though the provider's internals remain hidden. We decompose this boundary problem into three separable concerns: allocation (inter-class share via adaptive DRR), ordering (intra-class sequencing with feasible-set scoring), and overload control (explicit admit/defer/reject on a cost ladder). An information ladder experiment shows that coarse magnitude priors -- not class labels alone -- are the practical threshold for useful client control; removing magnitude inflates short-request P95 by up to and degrades deadline satisfaction. Under balanced / high congestion the full stack achieves 100% completion, 100% deadline satisfaction, and useful goodput of …
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
