LMetric: Simple is Better - Multiplication May Be All You Need for LLM Request Scheduling

Dingyan Zhang; Jinbo Han; Kaixi Zhang; Xingda Wei; Sijie Shen; Chenguang Fang; Wenyuan Yu; Jingren Zhou; and Rong Chen

arXiv:2603.15202·cs.DC·March 26, 2026

LMetric: Simple is Better - Multiplication May Be All You Need for LLM Request Scheduling

Dingyan Zhang, Jinbo Han, Kaixi Zhang, Xingda Wei, Sijie Shen, Chenguang Fang, Wenyuan Yu, Jingren Zhou, and Rong Chen

PDF

Open Access

TL;DR

This paper introduces a simple multiplication-based scheduling score for large language model requests that effectively balances workload and KV$ acceleration without hyperparameter tuning, outperforming existing methods.

Contribution

The paper proposes a novel, hyperparameter-free multiplication approach for LLM request scheduling that simplifies and improves upon complex existing combinator-based methods.

Findings

01

Reduces TTFT by up to 92% and 52%.

02

Reduces TPOT by approximately 20%.

03

Effective across diverse real-world workloads.

Abstract

High-quality LLM request scheduling requires achieving two key objectives: whether the routed instance has KV $t o a cce l er a t e t h er e q u es t e x ec u t i o nan d w h e t h er t h e w or k l o a d i s ba l an ce d a cr oss in s t an ces . A c hi e v in g b o t h o bj ec t i v es i sc ha l l e n g in g b ec a u se p u r s u in g o n eo bj ec t i v e ma y co m p r o mi se t h eo t h er . C u r r e n t a pp r o a c h es a d o pt v a r i o u sco mbina t or s (e . g ., l in e a r co mbina t i o n s) t oco m p u t e a sc h e d u l in g scor eco mbinin g in d i c a t or s f or t h e tw oo bj ec t i v es, w hi c ha r eco m pl e x in t ha tt h ey e i t h er r e q u i r es i g ni f i c an tw or k l o a d - s p ec i f i c h y p er p a r am e t er t u nin g or m o d e l - ha r d w a r e - a w a r es im u l a t or d e v e l o p m e n t, an d co u l d s t i l l l e a d t os u b o pt ima l p er f or man ce . I n t hi s p a p er, w es h o w t ha t u s in g a s im pl e m u l t i pl i c a t i o n o f tw oc a r e f u l l y c h ose nin d i c a t or s - o n e f or K V$ -aware (new prefill tokens if routed to an instance) and one for load balancing-aware (current batch size of the instance)-as the scheduling score can…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDistributed and Parallel Computing Systems · Software System Performance and Reliability · Cloud Computing and Resource Management