Semantic Scheduling for LLM Inference
Wenyue Hua, Dujian Ding, Yile Gu, Yujie Ren, Kai Mei, Minghua Ma, William Yang Wang

TL;DR
This paper proposes a semantic scheduling algorithm for large language model inference that prioritizes tasks based on their meaning and urgency, improving response times in critical scenarios like emergency management.
Contribution
It introduces a novel, optimal-time-complexity semantic scheduling algorithm for LLM requests, enabling context-aware prioritization in inference tasks.
Findings
Effective reduction in waiting time for urgent tasks
Demonstrated benefits in emergency management scenarios
Open-source code and data available for replication
Abstract
Conventional operating system scheduling algorithms are largely content-ignorant, making decisions based on factors such as latency or fairness without considering the actual intents or semantics of processes. Consequently, these algorithms often do not prioritize tasks that require urgent attention or carry higher importance, such as in emergency management scenarios. However, recent advances in language models enable semantic analysis of processes, allowing for more intelligent and context-aware scheduling decisions. In this paper, we introduce the concept of semantic scheduling in scheduling of requests from large language models (LLM), where the semantics of the process guide the scheduling priorities. We present a novel scheduling algorithm with optimal time complexity, designed to minimize the overall waiting time in LLM-based prompt scheduling. To illustrate its effectiveness, we…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques
