Serving Hybrid LLM Loads with SLO Guarantees Using CPU-GPU Attention Piggybacking

Zizhao Mo; Junlin Chen; Huanle Xu; Chengzhong Xu

arXiv:2603.12831·cs.DC·March 18, 2026

Serving Hybrid LLM Loads with SLO Guarantees Using CPU-GPU Attention Piggybacking

Zizhao Mo, Junlin Chen, Huanle Xu, Chengzhong Xu

PDF

Open Access

TL;DR

OmniServe is a novel LLM serving system that uses CPU-GPU attention piggybacking and dynamic batching to reduce interference, improve latency guarantees, and significantly boost throughput for shared cluster deployments.

Contribution

The paper introduces OmniServe, which leverages CPU-GPU attention piggybacking and adaptive batching to enhance resource utilization and SLO compliance in multi-tenant LLM serving environments.

Findings

01

Up to 1.48x improvement in SLO attainment rate for latency-sensitive services.

02

Up to 9.85x increase in best-effort service throughput.

03

Effective mitigation of interference in shared LLM serving clusters.

Abstract

Nowadays, service providers often deploy multiple types of LLM services within shared clusters. While the service colocation improves resource utilization, it introduces significant interference risks for latency-sensitive (LS) services-which have strict SLO requirements for inference latency-and severely constrain the service capacity of best-effort (BE) services due to limited available memory. To address interference, existing systems typically rely on reserving headroom to constrain BE resource usage. However, this approach's coarse granularity compromises the SLO compliance of the latency-sensitive service and unnecessarily restricts the generation potential of the best effort service. In this paper, we propose OmniServe, a novel LLM serving system that efficiently harnesses both CPU and GPU resources to mitigate interference and improve throughput. Central to OmniServe is the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsCloud Computing and Resource Management · Parallel Computing and Optimization Techniques · Advanced Data Storage Technologies