Context Parallelism for Scalable Million-Token Inference

Amy Yang; Jingyi Yang; Aya Ibrahim; Xinfeng Xie; Bangsheng Tang,; Grigory Sizov; Jeremy Reizenstein; Jongsoo Park; Jianyu Huang

arXiv:2411.01783·cs.DC·April 22, 2025

Context Parallelism for Scalable Million-Token Inference

Amy Yang, Jingyi Yang, Aya Ibrahim, Xinfeng Xie, Bangsheng Tang,, Grigory Sizov, Jeremy Reizenstein, Jongsoo Park, Jianyu Huang

PDF

Open Access

TL;DR

This paper introduces context parallelism for long-context large language model inference, enabling near-linear scaling and efficient prefill times on multi-GPU setups, with novel attention variants for various use cases.

Contribution

The paper proposes a novel context parallelism method and two lossless ring attention variants that significantly improve long-context inference scalability and efficiency.

Findings

01

Achieves 1M context prefill in 77s on Llama3 405B

02

Demonstrates near-linear scaling with up to 128 GPUs

03

Develops lossless ring attention variants: pass-KV and pass-Q

Abstract

We present context parallelism for long-context large language model inference, which achieves near-linear scaling for long-context prefill latency with up to 128 H100 GPUs across 16 nodes. Particularly, our method achieves 1M context prefill with Llama3 405B model in 77s (93% parallelization efficiency, 63% FLOPS utilization) and 128K context prefill in 3.8s. We develop two lossless exact ring attention variants: pass-KV and pass-Q to cover a wide range of use cases with the state-of-the-art performance: full prefill, persistent KV prefill and decode. Benchmarks on H100 GPU hosts inter-connected with RDMA and TCP both show similar scalability for long-context prefill, demonstrating that our method scales well using common commercial data center with medium-to-low inter-host bandwidth.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Data Compression Techniques · Algorithms and Data Compression · Machine Learning in Healthcare

MethodsSoftmax · Attention Is All You Need