Optimizing Long-context LLM Serving via Fine-grained Sequence Parallelism

Cong Li; Yuzhe Yang; Xuegui Zheng; Qifan Yang; Yijin Guan; Size Zheng; Li-Wen Chang; Shufan Liu; Xin Liu; Guangyu Sun

arXiv:2511.06247·cs.DC·November 20, 2025

Optimizing Long-context LLM Serving via Fine-grained Sequence Parallelism

Cong Li, Yuzhe Yang, Xuegui Zheng, Qifan Yang, Yijin Guan, Size Zheng, Li-Wen Chang, Shufan Liu, Xin Liu, Guangyu Sun

PDF

Open Access

TL;DR

This paper introduces CDSP and Tetris, a novel system for optimizing large language model inference by fine-grained, stage-specific sequence parallelism, significantly improving latency and capacity in diverse request scenarios.

Contribution

The paper proposes Chunkwise Dynamic Sequence Parallelism (CDSP) and Tetris, enabling flexible, resource-efficient LLM serving with dynamic parallelism regulation and resource fragmentation exploitation.

Findings

01

Up to 4.35× lower time-to-first-token (TTFT)

02

Reduces median time-between-tokens (TBT) by 40.1%

03

Increases max request capacity by 45%

Abstract

With the advancement of large language models (LLMs), their context windows have rapidly expanded. To meet diverse demands from varying-length requests in online services, existing state-of-the-art systems tune the sequence parallelism (SP) allocation. However, current dynamic SP allocation lacks flexibility to (1) support stage-specific parallelism requirements in LLM inference, (2) mitigate the global latency degradation from excessive SP allocation, and (3) exploit resource fragments arising from SP size variation. To tackle this problem, we propose Chunkwise Dynamic Sequence Parallelism (CDSP), a fine-grained parallelism strategy that assigns SP sizes across \textit{intra-request} token segments. Based on CDSP, we build Tetris, an LLM serving system that (1) efficiently integrates CDSP into disaggregated cluster to satisfy parallelism heterogeneity, (2) dynamically regulates SP…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSoftware System Performance and Reliability · Cloud Computing and Resource Management · Big Data and Digital Economy