MorphServe: Efficient and Workload-Aware LLM Serving via Runtime Quantized Layer Swapping and KV Cache Resizing

Zhaoyuan Su; Zeyu Zhang; Tingfeng Lan; Zirui Wang; Haiying Shen; Juncheng Yang; Yue Cheng

arXiv:2506.02006·cs.DC·January 8, 2026

MorphServe: Efficient and Workload-Aware LLM Serving via Runtime Quantized Layer Swapping and KV Cache Resizing

Zhaoyuan Su, Zeyu Zhang, Tingfeng Lan, Zirui Wang, Haiying Shen, Juncheng Yang, Yue Cheng

PDF

Open Access

TL;DR

MorphServe is a dynamic LLM serving framework that adaptively swaps layers and resizes caches at runtime, significantly reducing SLO violations and latency under bursty workloads without sacrificing accuracy.

Contribution

It introduces novel runtime mechanisms for workload-aware adaptation in LLM serving, enabling efficient and elastic deployment in dynamic environments.

Findings

01

Reduces average SLO violations by 92.45%

02

Improves P95 TTFT latency by 2.2x-3.9x

03

Maintains generation quality with workload adaptation

Abstract

Efficiently serving large language models (LLMs) under dynamic and bursty workloads remains a key challenge for real-world deployment. Existing serving frameworks and static model compression techniques fail to adapt to workload fluctuations, leading to either service-level objective (SLO) violations under full-precision serving or persistent accuracy degradation with static quantization. We present MorphServe, a dynamic, workload-aware LLM serving framework based on morphological adaptation. MorphServe introduces two asynchronous, token-level runtime mechanisms: quantized layer swapping, which selectively replaces less impactful layers with quantized alternatives during high-load periods, and pressure-aware KV cache resizing, which dynamically adjusts KV cache capacity in response to memory pressure. These mechanisms enable state-preserving transitions with minimum runtime overhead and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsCaching and Content Delivery · Distributed and Parallel Computing Systems · Advanced Data Storage Technologies

MethodsSoftmax · Attention Is All You Need · LLaMA