PipeLive: Efficient Live In-place Pipeline Parallelism Reconfiguration for Dynamic LLM Serving
Xu Bai, Muhammed Tawfiqul Islam, Chen Wang, Adel N. Toosi

TL;DR
PipeLive enables live, in-place reconfiguration of pipeline parallelism for large language models, significantly reducing downtime and improving inference speed during dynamic environment adjustments.
Contribution
It introduces a novel KV cache layout, live KV resizing, and incremental KV patching mechanisms for seamless pipeline reconfiguration without interruption.
Findings
2.5X reduction in time-to-first-token (TTFT) compared to static configurations.
Reconfiguration overhead reduced from seconds to under 10ms.
Improved TTFT and time-per-output-token (TPOT) by up to 54.7% and 14.7%.
Abstract
Pipeline parallelism (PP) is widely used to partition layers of large language models (LLMs) across GPUs, enabling scalable inference for large models. However, existing systems rely on static PP configurations that fail to adapt to dynamic settings, such as serverless platforms and heterogeneous GPU environments. Reconfiguring PP by stopping and redeploying service incurs prohibitive downtime, so reconfiguration must instead proceed live and in place, without interrupting inference. However, live in-place PP reconfiguration is fundamentally challenging. GPUs are already saturated with model weights and KV cache, leaving little room for new layer placements and necessitating KV cache resizing, at odds with systems like vLLM that preallocate for throughput. Moreover, maintaining KV consistency during execution is difficult: stop-and-copy introduces large pauses, while background…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
