Scheduling LLM Inference with Uncertainty-Aware Output Length Predictions

Haoyu Zheng; Yongqiang Zhang; Fangcheng Fu; Xiaokai Zhou; Hao Luo; Hongchao Zhu; Yuanyuan Zhu; Hao Wang; Xiao Yan; and Jiawei Jiang

arXiv:2604.00499·cs.LG·April 2, 2026

Scheduling LLM Inference with Uncertainty-Aware Output Length Predictions

Haoyu Zheng, Yongqiang Zhang, Fangcheng Fu, Xiaokai Zhou, Hao Luo, Hongchao Zhu, Yuanyuan Zhu, Hao Wang, Xiao Yan, and Jiawei Jiang

PDF

TL;DR

This paper introduces a distribution-based approach to predict output lengths in LLM inference scheduling, improving latency and throughput by accounting for uncertainty in output length predictions.

Contribution

It proposes a novel distribution fitting method and a metric called TIE for better scheduling of LLM inference based on output length uncertainty.

Findings

01

TIE reduces per-token latency by 2.31 times in online inference.

02

TIE improves throughput by 1.42 times in offline data generation.

03

Output length follows a heavy-tailed distribution, well modeled by log-t distribution.

Abstract

To schedule LLM inference, the \textit{shortest job first} (SJF) principle is favorable by prioritizing requests with short output lengths to avoid head-of-line (HOL) blocking. Existing methods usually predict a single output length for each request to facilitate scheduling. We argue that such a \textit{point estimate} does not match the \textit{stochastic} decoding process of LLM inference, where output length is \textit{uncertain} by nature and determined by when the end-of-sequence (EOS) token is sampled. Hence, the output length of each request should be fitted with a distribution rather than a single value. With an in-depth analysis of empirical data and the stochastic decoding process, we observe that output length follows a heavy-tailed distribution and can be fitted with the log-t distribution. On this basis, we propose a simple metric called Tail Inflated Expectation (TIE) to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.