MorphServe: Efficient and Workload-Aware LLM Serving via Runtime Quantized Layer Swapping and KV Cache Resizing
Zhaoyuan Su, Zeyu Zhang, Tingfeng Lan, Zirui Wang, Haiying Shen, Juncheng Yang, Yue Cheng

TL;DR
MorphServe is a dynamic LLM serving framework that adaptively swaps layers and resizes caches at runtime, significantly reducing SLO violations and latency under bursty workloads without sacrificing accuracy.
Contribution
It introduces novel runtime mechanisms for workload-aware adaptation in LLM serving, enabling efficient and elastic deployment in dynamic environments.
Findings
Reduces average SLO violations by 92.45%
Improves P95 TTFT latency by 2.2x-3.9x
Maintains generation quality with workload adaptation
Abstract
Efficiently serving large language models (LLMs) under dynamic and bursty workloads remains a key challenge for real-world deployment. Existing serving frameworks and static model compression techniques fail to adapt to workload fluctuations, leading to either service-level objective (SLO) violations under full-precision serving or persistent accuracy degradation with static quantization. We present MorphServe, a dynamic, workload-aware LLM serving framework based on morphological adaptation. MorphServe introduces two asynchronous, token-level runtime mechanisms: quantized layer swapping, which selectively replaces less impactful layers with quantized alternatives during high-load periods, and pressure-aware KV cache resizing, which dynamically adjusts KV cache capacity in response to memory pressure. These mechanisms enable state-preserving transitions with minimum runtime overhead and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCaching and Content Delivery · Distributed and Parallel Computing Systems · Advanced Data Storage Technologies
MethodsSoftmax · Attention Is All You Need · LLaMA
