Harli: SLO-Aware Co-location of LLM Inference and PEFT-based Finetuning on Model-as-a-Service Platforms

Ao Xu; Han Zhao; Weihao Cui; Quan Chen; Yukang Chen; Shulai Zhang; Shuang Chen; Jiemin Jiang; Zhibin Yu; Minyi Guo

arXiv:2511.11729·cs.DC·November 20, 2025

Harli: SLO-Aware Co-location of LLM Inference and PEFT-based Finetuning on Model-as-a-Service Platforms

Ao Xu, Han Zhao, Weihao Cui, Quan Chen, Yukang Chen, Shulai Zhang, Shuang Chen, Jiemin Jiang, Zhibin Yu, Minyi Guo

PDF

Open Access

TL;DR

Harli is a system that enhances GPU utilization in LLM serving by co-locating PEFT finetuning tasks with decode instances, achieving significant throughput gains while ensuring QoS.

Contribution

Harli introduces a novel co-location approach with a unified memory allocator, latency predictor, and QoS scheduler to optimize GPU utilization in LLM MaaS platforms.

Findings

01

Increases finetune throughput by 46.2% on average

02

Maintains strict QoS guarantees for inference decode

03

Outperforms state-of-the-art systems significantly

Abstract

Large language models (LLMs) are increasingly deployed under the Model-as-a-Service (MaaS) paradigm. To meet stringent quality-of-service (QoS) requirements, existing LLM serving systems disaggregate the prefill and decode phases of inference. However, decode instances often experience low GPU utilization due to their memory-bound nature and insufficient batching in dynamic workloads, leaving compute resources underutilized. We introduce Harli, a serving system that improves GPU utilization by co-locating parameter-efficient finetuning (PEFT) tasks with LLM decode instances. PEFT tasks are compute-bound and memory-efficient, making them ideal candidates for safe co-location. Specifically, Harli addresses key challenges--limited memory and unpredictable interference--using three components: a unified memory allocator for runtime memory reuse, a two-stage latency predictor for decode…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSoftware System Performance and Reliability · Big Data and Digital Economy · Scientific Computing and Data Management