LoongServe: Efficiently Serving Long-Context Large Language Models with Elastic Sequence Parallelism
Bingyang Wu, Shengyu Liu, Yinmin Zhong, Peng Sun, Xuanzhe Liu, Xin Jin

TL;DR
LoongServe introduces elastic sequence parallelism to adaptively optimize resource utilization in serving long-context large language models, significantly boosting throughput and efficiency across variable request phases.
Contribution
It proposes elastic sequence parallelism (ESP) and implements LoongServe, a system that dynamically adjusts parallelism, reduces communication overhead, and enhances GPU memory efficiency for LLM serving.
Findings
Up to 3.85× maximum throughput improvement.
Up to 5.81× throughput increase over baseline methods.
Enhanced resource utilization and efficiency in LLM serving.
Abstract
The context window of large language models (LLMs) is rapidly increasing, leading to a huge variance in resource usage between different requests as well as between different phases of the same request. Restricted by static parallelism strategies, existing LLM serving systems cannot efficiently utilize the underlying resources to serve variable-length requests in different phases. To address this problem, we propose a new parallelism paradigm, elastic sequence parallelism (ESP), to elastically adapt to the variance between different requests and phases. Based on ESP, we design and build LoongServe, an LLM serving system that (1) improves computation efficiency by elastically adjusting the degree of parallelism in real-time, (2) improves communication efficiency by reducing key-value cache migration overhead and overlapping partial decoding communication with computation, and (3)…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Advanced Neural Network Applications · Parallel Computing and Optimization Techniques
MethodsHierarchical Feature Fusion · Dilated Convolution · Pointwise Convolution · Fragmentation · Efficient Spatial Pyramid
