CoLLM: Continuous Adaptation for SLO-Aware LLM Serving on Shared GPU Clusters

Shaoyuan Huang; Yunfeng Zhao; Na Yan; Tiancheng Zhang; Xiaokai Wang; Xiaofei Wang; Wenyu Wang; Yansha Deng

arXiv:2604.16400·cs.DC·May 19, 2026

CoLLM: Continuous Adaptation for SLO-Aware LLM Serving on Shared GPU Clusters

Shaoyuan Huang, Yunfeng Zhao, Na Yan, Tiancheng Zhang, Xiaokai Wang, Xiaofei Wang, Wenyu Wang, Yansha Deng

PDF

TL;DR

CoLLM introduces a unified framework for continuous adaptation of LLMs on shared GPU clusters, optimizing fine-tuning and inference for edge applications with improved efficiency and quality.

Contribution

The paper presents CoLLM, a novel system that unifies federated parameter-efficient fine-tuning and inference through a co-execution framework for shared GPU clusters.

Findings

01

Up to 3x higher goodput compared to state-of-the-art systems.

02

Effective real-time model parameter reuse via intra-replica sharing.

03

Adaptive workload balancing improves long-term model quality and inference efficiency.

Abstract

As Large Language Models (LLMs) are increasingly adopted in edge intelligence to power domain-specific applications and personalized services, the quality and efficiency of the LLM post-training phase-including fine-tuning and inference, have become critical due to constrained resources. Although recent advances in federated parameter-efficient fine-tuning (FL PEFT) and low-latency inference have improved individual task performance, fine-tuning and inference are still handled as isolated workloads, which overlooks their interdependence and results in redundant deployments and delayed improvement in inference quality. To address these limitations, we introduce a new co-execution framework and instantiate it with CoLLM, a system that unifies FL PEFT and inference on shared edge replicas and model parameters. CoLLM addresses key challenges at both replica and cluster levels through: (1)…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.