LMetric: Simple is Better - Multiplication May Be All You Need for LLM Request Scheduling
Dingyan Zhang, Jinbo Han, Kaixi Zhang, Xingda Wei, Sijie Shen, Chenguang Fang, Wenyuan Yu, Jingren Zhou, and Rong Chen

TL;DR
This paper introduces a simple multiplication-based scheduling score for large language model requests that effectively balances workload and KV$ acceleration without hyperparameter tuning, outperforming existing methods.
Contribution
The paper proposes a novel, hyperparameter-free multiplication approach for LLM request scheduling that simplifies and improves upon complex existing combinator-based methods.
Findings
Reduces TTFT by up to 92% and 52%.
Reduces TPOT by approximately 20%.
Effective across diverse real-world workloads.
Abstract
High-quality LLM request scheduling requires achieving two key objectives: whether the routed instance has KV-aware (new prefill tokens if routed to an instance) and one for load balancing-aware (current batch size of the instance)-as the scheduling score can…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDistributed and Parallel Computing Systems · Software System Performance and Reliability · Cloud Computing and Resource Management
