Scheduling the Unschedulable: Taming Black-Box LLM Inference at Scale

Renzhong Yuan; Yijun Zeng; Xiaosong Gao; Linxi Yu; Haochun Liao; Han Wang

arXiv:2604.06970·cs.DC·April 9, 2026

Scheduling the Unschedulable: Taming Black-Box LLM Inference at Scale

Renzhong Yuan, Yijun Zeng, Xiaosong Gao, Linxi Yu, Haochun Liao, Han Wang

PDF

TL;DR

This paper presents a client-side scheduling framework for black-box LLM inference that improves throughput and deadline satisfaction by decomposing the problem into allocation, ordering, and overload control, with practical experiments and policy comparisons.

Contribution

It introduces a three-layer client-side decomposition for scheduling black-box LLM inference, with experimental validation and analysis of different allocation and overload policies.

Findings

01

Coarse magnitude priors are crucial for effective client control.

02

Full stack achieves 100% completion and deadline satisfaction under high congestion.

03

Fair Queuing improves short-request tail latency compared to FIFO.

Abstract

When output token counts can be predicted at submission time (Gan et al., 2026), client-side scheduling against a black-box LLM API becomes semi-clairvoyant: decisions condition on coarse token priors even though the provider's internals remain hidden. We decompose this boundary problem into three separable concerns: allocation (inter-class share via adaptive DRR), ordering (intra-class sequencing with feasible-set scoring), and overload control (explicit admit/defer/reject on a cost ladder). An information ladder experiment shows that coarse magnitude priors -- not class labels alone -- are the practical threshold for useful client control; removing magnitude inflates short-request P95 by up to $5.8 \times$ and degrades deadline satisfaction. Under balanced / high congestion the full stack achieves 100% completion, 100% deadline satisfaction, and useful goodput of $4.2 \pm 1.6$ …

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.