Context Parallelism for Scalable Million-Token Inference
Amy Yang, Jingyi Yang, Aya Ibrahim, Xinfeng Xie, Bangsheng Tang,, Grigory Sizov, Jeremy Reizenstein, Jongsoo Park, Jianyu Huang

TL;DR
This paper introduces context parallelism for long-context large language model inference, enabling near-linear scaling and efficient prefill times on multi-GPU setups, with novel attention variants for various use cases.
Contribution
The paper proposes a novel context parallelism method and two lossless ring attention variants that significantly improve long-context inference scalability and efficiency.
Findings
Achieves 1M context prefill in 77s on Llama3 405B
Demonstrates near-linear scaling with up to 128 GPUs
Develops lossless ring attention variants: pass-KV and pass-Q
Abstract
We present context parallelism for long-context large language model inference, which achieves near-linear scaling for long-context prefill latency with up to 128 H100 GPUs across 16 nodes. Particularly, our method achieves 1M context prefill with Llama3 405B model in 77s (93% parallelization efficiency, 63% FLOPS utilization) and 128K context prefill in 3.8s. We develop two lossless exact ring attention variants: pass-KV and pass-Q to cover a wide range of use cases with the state-of-the-art performance: full prefill, persistent KV prefill and decode. Benchmarks on H100 GPU hosts inter-connected with RDMA and TCP both show similar scalability for long-context prefill, demonstrating that our method scales well using common commercial data center with medium-to-low inter-host bandwidth.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Data Compression Techniques · Algorithms and Data Compression · Machine Learning in Healthcare
MethodsSoftmax · Attention Is All You Need
