No Request Left Behind: Tackling Heterogeneity in Long-Context LLM Inference with Medha

Amey Agrawal; Haoran Qiu; Junda Chen,\'I\~nigo Goiri; Chaojie Zhang; Rayyan Shahid; Ramachandran Ramjee; Alexey Tumanov; Esha Choukse

arXiv:2409.17264·cs.LG·November 27, 2025

No Request Left Behind: Tackling Heterogeneity in Long-Context LLM Inference with Medha

Amey Agrawal, Haoran Qiu, Junda Chen,\'I\~nigo Goiri, Chaojie Zhang, Rayyan Shahid, Ramachandran Ramjee, Alexey Tumanov, Esha Choukse

PDF

Open Access

TL;DR

Medha is a novel serving system for large language models that employs preemptive scheduling and parallelism strategies to handle heterogeneous workloads efficiently, significantly improving throughput and latency.

Contribution

Medha introduces a comprehensive preemptive scheduling framework with mechanisms like Adaptive Chunking, Stream Pipeline Parallel, and KV-Cache Parallelism, orchestrated by the LARS scheduler, to mitigate convoy effects in LLM inference.

Findings

01

5.7x throughput improvement under heterogeneous workload

02

30x reduction in median latency

03

174x reduction in 99th percentile latency

Abstract

Deploying million-token Large Language Models (LLMs) is challenging because production workloads are highly heterogeneous, mixing short queries and long documents. This heterogeneity, combined with the quadratic complexity of attention, creates severe convoy effects where long-running requests stall short, interactive ones, degrading system responsiveness. We present Medha, a serving system that eliminates these convoys by introducing fine-grained, preemptive scheduling to LLM inference. Medha makes preemption practical with a co-designed set of mechanisms -- including Adaptive Chunking and Stream Pipeline Parallel that overcome the perceived inefficiencies and scaling challenges of chunking. Additionally, we present a new parallelism strategy KV-Cache Parallelism to reduce the decode latency and afford interactivity despite very long context. These mechanisms are orchestrated by a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSemantic Web and Ontologies

Methodstravel james · Attentive Walk-Aggregating Graph Neural Network