HyGen: Efficient LLM Serving via Elastic Online-Offline Request Co-location
Ting Sun, Penghan Wang, Fan Lai

TL;DR
HyGen is a system that efficiently co-locates online and offline large language model workloads, improving resource utilization and throughput while maintaining latency service-level objectives through interference-aware scheduling.
Contribution
HyGen introduces a novel interference-aware LLM serving system with performance control and SLO-aware scheduling to optimize resource use and throughput.
Findings
Achieves 3.9-5.8x throughput improvements
Maintains latency SLOs effectively
Demonstrates efficiency on production workloads
Abstract
Large language models (LLMs) have facilitated a wide range of applications with distinct service-level objectives (SLOs), from latency-sensitive online tasks like interactive chatbots to throughput-oriented offline workloads like data synthesis. The existing deployment model, which dedicates machines to each workload, simplifies SLO management but often leads to poor resource utilization. This paper introduces HyGen, an interference-aware LLM serving system that enables efficient co-location of online and offline workloads while preserving SLOs. HyGen incorporates two key innovations: (1) performance control mechanisms, including a latency predictor to estimate batch execution time and an SLO-aware profiler to quantify latency interference, and (2) SLO-aware offline scheduling policies that maximize serving throughput and prevent starvation. Our evaluation on production workloads shows…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCaching and Content Delivery · Cooperative Communication and Network Coding · IoT and Edge/Fog Computing
