No Request Left Behind: Tackling Heterogeneity in Long-Context LLM Inference with Medha
Amey Agrawal, Haoran Qiu, Junda Chen,\'I\~nigo Goiri, Chaojie Zhang, Rayyan Shahid, Ramachandran Ramjee, Alexey Tumanov, Esha Choukse

TL;DR
Medha is a novel serving system for large language models that employs preemptive scheduling and parallelism strategies to handle heterogeneous workloads efficiently, significantly improving throughput and latency.
Contribution
Medha introduces a comprehensive preemptive scheduling framework with mechanisms like Adaptive Chunking, Stream Pipeline Parallel, and KV-Cache Parallelism, orchestrated by the LARS scheduler, to mitigate convoy effects in LLM inference.
Findings
5.7x throughput improvement under heterogeneous workload
30x reduction in median latency
174x reduction in 99th percentile latency
Abstract
Deploying million-token Large Language Models (LLMs) is challenging because production workloads are highly heterogeneous, mixing short queries and long documents. This heterogeneity, combined with the quadratic complexity of attention, creates severe convoy effects where long-running requests stall short, interactive ones, degrading system responsiveness. We present Medha, a serving system that eliminates these convoys by introducing fine-grained, preemptive scheduling to LLM inference. Medha makes preemption practical with a co-designed set of mechanisms -- including Adaptive Chunking and Stream Pipeline Parallel that overcome the perceived inefficiencies and scaling challenges of chunking. Additionally, we present a new parallelism strategy KV-Cache Parallelism to reduce the decode latency and afford interactivity despite very long context. These mechanisms are orchestrated by a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSemantic Web and Ontologies
Methodstravel james · Attentive Walk-Aggregating Graph Neural Network
