Competitive Non-Clairvoyant KV-Cache Scheduling for LLM Inference
Yiding Feng, Zonghan Yang, Yuhao Zhang

TL;DR
This paper introduces a non-clairvoyant scheduling algorithm for LLM inference with KV caches, achieving the first constant competitive ratio without prior knowledge of request sizes, and demonstrating robustness in real trace experiments.
Contribution
The paper presents the Geometric Slicing Algorithm (GSA), the first non-clairvoyant policy with proven constant competitive ratio for LLM KV-cache scheduling in offline batch settings.
Findings
GSA achieves a competitive ratio of at most 61.92, improving to 32 in large-memory regimes.
The clairvoyant GBA algorithm achieves an approximation ratio of 10.67, greatly better than previous bounds.
Numerical experiments show robust performance of the proposed algorithms on real request traces.
Abstract
Large Language Model (LLM) inference presents a unique scheduling challenge due to the Key-Value (KV) cache, where a job's memory footprint grows linearly with the number of decoded tokens. This growth couples scheduling decisions with feasibility: a scheduler must minimize latency under a hard memory budget, yet the response lengths of requests are inherently unknown. While recent works have explored this problem either assuming clairvoyance -- exact knowledge of response lengths -- or relying on machine-learned predictions, obtaining robust performance guarantees without any prior knowledge of job sizes remains a theoretically fundamental and practically important open problem. In this work, we propose the Geometric Slicing Algorithm (GSA), the non-clairvoyant policy to achieve the first constant competitive ratio for this problem in the offline batch setting. GSA manages…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsBig Data and Digital Economy · Advanced Neural Network Applications · Parallel Computing and Optimization Techniques
