FLYING SERVING: On-the-Fly Parallelism Switching for Large Language Model Serving
Shouwei Gao, Junqi Yin, Feiyi Wang, Wenqian Dong

TL;DR
Flying Serving introduces an innovative system that enables real-time switching between data and tensor parallelism in large language model serving, enhancing performance and flexibility without restarting workers.
Contribution
It presents a novel approach for online DP-TP switching in LLM serving, including virtualization techniques for state management and a deadlock-free scheduler.
Findings
Up to 4.79× performance improvement under high load
Supports latency- and memory-driven requests effectively
Enables seamless reconfiguration without restarting workers
Abstract
Production LLM serving must simultaneously deliver high throughput, low latency, and sufficient context capacity under non-stationary traffic and mixed request requirements. Data parallelism (DP) maximizes throughput by running independent replicas, while tensor parallelism (TP) reduces per-request latency and pools memory for long-context inference. However, existing serving stacks typically commit to a static parallelism configuration at deployment; adapting to bursts, priorities, or long-context requests is often disruptive and slow. We present Flying Serving, a vLLM-based system that enables online DP-TP switching without restarting engine workers. Flying Serving makes reconfiguration practical by virtualizing the state that would otherwise force data movement: (i) a zero-copy Model Weights Manager that exposes TP shard views on demand, (ii) a KV Cache Adaptor that preserves request…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware System Performance and Reliability · Distributed systems and fault tolerance · Parallel Computing and Optimization Techniques
