LoongServe: Efficiently Serving Long-Context Large Language Models with   Elastic Sequence Parallelism

Bingyang Wu; Shengyu Liu; Yinmin Zhong; Peng Sun; Xuanzhe Liu; Xin Jin

arXiv:2404.09526·cs.DC·October 30, 2024·2 cites

LoongServe: Efficiently Serving Long-Context Large Language Models with Elastic Sequence Parallelism

Bingyang Wu, Shengyu Liu, Yinmin Zhong, Peng Sun, Xuanzhe Liu, Xin Jin

PDF

Open Access 1 Repo

TL;DR

LoongServe introduces elastic sequence parallelism to adaptively optimize resource utilization in serving long-context large language models, significantly boosting throughput and efficiency across variable request phases.

Contribution

It proposes elastic sequence parallelism (ESP) and implements LoongServe, a system that dynamically adjusts parallelism, reduces communication overhead, and enhances GPU memory efficiency for LLM serving.

Findings

01

Up to 3.85× maximum throughput improvement.

02

Up to 5.81× throughput increase over baseline methods.

03

Enhanced resource utilization and efficiency in LLM serving.

Abstract

The context window of large language models (LLMs) is rapidly increasing, leading to a huge variance in resource usage between different requests as well as between different phases of the same request. Restricted by static parallelism strategies, existing LLM serving systems cannot efficiently utilize the underlying resources to serve variable-length requests in different phases. To address this problem, we propose a new parallelism paradigm, elastic sequence parallelism (ESP), to elastically adapt to the variance between different requests and phases. Based on ESP, we design and build LoongServe, an LLM serving system that (1) improves computation efficiency by elastically adjusting the degree of parallelism in real-time, (2) improves communication efficiency by reducing key-value cache migration overhead and overlapping partial decoding communication with computation, and (3)…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

LoongServe/LoongServe
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Advanced Neural Network Applications · Parallel Computing and Optimization Techniques

MethodsHierarchical Feature Fusion · Dilated Convolution · Pointwise Convolution · Fragmentation · Efficient Spatial Pyramid