Optimal Scheduling Algorithms for LLM Inference: Theory and Practice
Agrim Bari, Parikshit Hegde, Gustavo de Veciana

TL;DR
This paper develops a theoretical framework and practical schedulers for optimizing Large Language Model inference, significantly improving throughput and latency management in real-world deployment scenarios.
Contribution
It introduces the RAD scheduler with throughput optimality and the SLAI scheduler for meeting SLOs, advancing request routing and scheduling strategies for LLM inference systems.
Findings
SLAI reduces median TTFT by 53%
SLAI increases maximum serving capacity by 26%
Median TTFT achieved below 0.5 seconds
Abstract
With the growing use of Large Language Model (LLM)-based tools like ChatGPT, Perplexity, and Gemini across industries, there is a rising need for efficient LLM inference systems. These systems handle requests with a unique two-phase computation structure: a prefill-phase that processes the full input prompt and a decode-phase that autoregressively generates tokens one at a time. This structure calls for new strategies for routing and scheduling requests. In this paper, we take a comprehensive approach to this challenge by developing a theoretical framework that models routing and scheduling in LLM inference systems. We identify two key design principles-optimal tiling and dynamic resource allocation-that are essential for achieving high throughput. Guided by these principles, we propose the Resource-Aware Dynamic (RAD) scheduler and prove that it achieves throughput optimality under…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsIoT and Edge/Fog Computing · Big Data and Digital Economy · Advanced Neural Network Applications
