Don't Stop Me Now: Embedding Based Scheduling for LLMs
Rana Shahout, Eran Malach, Chunwei Liu, Weifan Jiang, Minlan Yu,, Michael Mitzenmacher

TL;DR
This paper introduces TRAIL, a novel method for predicting output lengths in LLMs to enable size-based scheduling, improving efficiency by balancing preemption and memory overhead in interactive applications.
Contribution
The paper presents TRAIL, a lightweight embedding-based prediction technique for LLM output lengths and a new SRPT scheduling variant optimized for memory and preemption in LLM systems.
Findings
TRAIL accurately predicts remaining output length after each token.
The proposed SRPT variant improves system throughput and reduces latency.
Theoretical analysis confirms the effectiveness of the scheduling approach.
Abstract
Efficient scheduling is crucial for interactive Large Language Model (LLM) applications, where low request completion time directly impacts user engagement. Size-based scheduling algorithms like Shortest Remaining Process Time (SRPT) aim to reduce average request completion time by leveraging known or estimated request sizes and allowing preemption by incoming jobs with shorter service times. However, two main challenges arise when applying size-based scheduling to LLM systems. First, accurately predicting output lengths from prompts is challenging and often resource-intensive, making it impractical for many systems. As a result, the state-of-the-art LLM systems default to first-come, first-served scheduling, which can lead to head-of-line blocking and reduced system efficiency. Second, preemption introduces extra memory overhead to LLM systems as they must maintain intermediate states…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsScheduling and Optimization Algorithms · Distributed and Parallel Computing Systems
Methodstravel james
