Semantic Scheduling for LLM Inference

Wenyue Hua; Dujian Ding; Yile Gu; Yujie Ren; Kai Mei; Minghua Ma; William Yang Wang

arXiv:2506.12204·cs.LG·June 17, 2025

Semantic Scheduling for LLM Inference

Wenyue Hua, Dujian Ding, Yile Gu, Yujie Ren, Kai Mei, Minghua Ma, William Yang Wang

PDF

Open Access 1 Repo

TL;DR

This paper proposes a semantic scheduling algorithm for large language model inference that prioritizes tasks based on their meaning and urgency, improving response times in critical scenarios like emergency management.

Contribution

It introduces a novel, optimal-time-complexity semantic scheduling algorithm for LLM requests, enabling context-aware prioritization in inference tasks.

Findings

01

Effective reduction in waiting time for urgent tasks

02

Demonstrated benefits in emergency management scenarios

03

Open-source code and data available for replication

Abstract

Conventional operating system scheduling algorithms are largely content-ignorant, making decisions based on factors such as latency or fairness without considering the actual intents or semantics of processes. Consequently, these algorithms often do not prioritize tasks that require urgent attention or carry higher importance, such as in emergency management scenarios. However, recent advances in language models enable semantic analysis of processes, allowing for more intelligent and context-aware scheduling decisions. In this paper, we introduce the concept of semantic scheduling in scheduling of requests from large language models (LLM), where the semantics of the process guide the scheduling priorities. We present a novel scheduling algorithm with optimal time complexity, designed to minimize the overall waiting time in LLM-based prompt scheduling. To illustrate its effectiveness, we…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

wenyueh/latency_optimization_with_priority_constraints
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques