HyGen: Efficient LLM Serving via Elastic Online-Offline Request Co-location

Ting Sun; Penghan Wang; Fan Lai

arXiv:2501.14808·cs.DC·October 31, 2025

HyGen: Efficient LLM Serving via Elastic Online-Offline Request Co-location

Ting Sun, Penghan Wang, Fan Lai

PDF

Open Access

TL;DR

HyGen is a system that efficiently co-locates online and offline large language model workloads, improving resource utilization and throughput while maintaining latency service-level objectives through interference-aware scheduling.

Contribution

HyGen introduces a novel interference-aware LLM serving system with performance control and SLO-aware scheduling to optimize resource use and throughput.

Findings

01

Achieves 3.9-5.8x throughput improvements

02

Maintains latency SLOs effectively

03

Demonstrates efficiency on production workloads

Abstract

Large language models (LLMs) have facilitated a wide range of applications with distinct service-level objectives (SLOs), from latency-sensitive online tasks like interactive chatbots to throughput-oriented offline workloads like data synthesis. The existing deployment model, which dedicates machines to each workload, simplifies SLO management but often leads to poor resource utilization. This paper introduces HyGen, an interference-aware LLM serving system that enables efficient co-location of online and offline workloads while preserving SLOs. HyGen incorporates two key innovations: (1) performance control mechanisms, including a latency predictor to estimate batch execution time and an SLO-aware profiler to quantify latency interference, and (2) SLO-aware offline scheduling policies that maximize serving throughput and prevent starvation. Our evaluation on production workloads shows…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsCaching and Content Delivery · Cooperative Communication and Network Coding · IoT and Edge/Fog Computing