LLM Query Scheduling with Prefix Reuse and Latency Constraints

Gregory Dexter; Shao Tang; Ata Fatahi Baarzi; Qingquan Song; Tejas Dharamsi; Aman Gupta

arXiv:2502.04677·cs.DS·January 5, 2026

LLM Query Scheduling with Prefix Reuse and Latency Constraints

Gregory Dexter, Shao Tang, Ata Fatahi Baarzi, Qingquan Song, Tejas Dharamsi, Aman Gupta

PDF

Open Access

TL;DR

This paper introduces a formal framework and a novel scheduling algorithm for large language model inference that leverages prefix reuse to meet strict latency constraints, improving performance in real-world settings.

Contribution

It reveals limitations of existing scheduling strategies, establishes the NP-hardness of the problem, and proposes the $k$-LPM algorithm with theoretical guarantees and empirical validation.

Findings

01

$k$-LPM improves TTFT performance under realistic traffic.

02

Existing FCFS and LPM strategies have limitations with latency constraints.

03

Empirical results show significant TTFT reductions in practical settings.

Abstract

The efficient deployment of large language models (LLMs) in online settings requires optimizing inference performance under stringent latency constraints, particularly the time-to-first-token (TTFT) and time-per-output-token (TPOT). This paper focuses on the query scheduling problem for LLM inference with prefix reuse, a technique that leverages shared prefixes across queries to reduce computational overhead. Our work reveals previously unknown limitations of the existing first-come-first-serve (FCFS) and longest-prefix-match (LPM) scheduling strategies with respect to satisfying latency constraints. We present a formal theoretical framework for LLM query scheduling under RadixAttention, a prefix reuse mechanism that stores and reuses intermediate representations in a radix tree structure. Our analysis establishes the NP-hardness of the scheduling problem with prefix reuse under TTFT…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDistributed and Parallel Computing Systems · Advanced Database Systems and Queries · Advanced Data Storage Technologies