Serving Hybrid LLM Loads with SLO Guarantees Using CPU-GPU Attention Piggybacking
Zizhao Mo, Junlin Chen, Huanle Xu, Chengzhong Xu

TL;DR
OmniServe is a novel LLM serving system that uses CPU-GPU attention piggybacking and dynamic batching to reduce interference, improve latency guarantees, and significantly boost throughput for shared cluster deployments.
Contribution
The paper introduces OmniServe, which leverages CPU-GPU attention piggybacking and adaptive batching to enhance resource utilization and SLO compliance in multi-tenant LLM serving environments.
Findings
Up to 1.48x improvement in SLO attainment rate for latency-sensitive services.
Up to 9.85x increase in best-effort service throughput.
Effective mitigation of interference in shared LLM serving clusters.
Abstract
Nowadays, service providers often deploy multiple types of LLM services within shared clusters. While the service colocation improves resource utilization, it introduces significant interference risks for latency-sensitive (LS) services-which have strict SLO requirements for inference latency-and severely constrain the service capacity of best-effort (BE) services due to limited available memory. To address interference, existing systems typically rely on reserving headroom to constrain BE resource usage. However, this approach's coarse granularity compromises the SLO compliance of the latency-sensitive service and unnecessarily restricts the generation potential of the best effort service. In this paper, we propose OmniServe, a novel LLM serving system that efficiently harnesses both CPU and GPU resources to mitigate interference and improve throughput. Central to OmniServe is the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCloud Computing and Resource Management · Parallel Computing and Optimization Techniques · Advanced Data Storage Technologies
