PipeLive: Efficient Live In-place Pipeline Parallelism Reconfiguration for Dynamic LLM Serving

Xu Bai; Muhammed Tawfiqul Islam; Chen Wang; Adel N. Toosi

arXiv:2604.12171·cs.DC·April 15, 2026

PipeLive: Efficient Live In-place Pipeline Parallelism Reconfiguration for Dynamic LLM Serving

Xu Bai, Muhammed Tawfiqul Islam, Chen Wang, Adel N. Toosi

PDF

TL;DR

PipeLive enables live, in-place reconfiguration of pipeline parallelism for large language models, significantly reducing downtime and improving inference speed during dynamic environment adjustments.

Contribution

It introduces a novel KV cache layout, live KV resizing, and incremental KV patching mechanisms for seamless pipeline reconfiguration without interruption.

Findings

01

2.5X reduction in time-to-first-token (TTFT) compared to static configurations.

02

Reconfiguration overhead reduced from seconds to under 10ms.

03

Improved TTFT and time-per-output-token (TPOT) by up to 54.7% and 14.7%.

Abstract

Pipeline parallelism (PP) is widely used to partition layers of large language models (LLMs) across GPUs, enabling scalable inference for large models. However, existing systems rely on static PP configurations that fail to adapt to dynamic settings, such as serverless platforms and heterogeneous GPU environments. Reconfiguring PP by stopping and redeploying service incurs prohibitive downtime, so reconfiguration must instead proceed live and in place, without interrupting inference. However, live in-place PP reconfiguration is fundamentally challenging. GPUs are already saturated with model weights and KV cache, leaving little room for new layer placements and necessitating KV cache resizing, at odds with systems like vLLM that preallocate for throughput. Moreover, maintaining KV consistency during execution is difficult: stop-and-copy introduces large pauses, while background…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.