Understanding and Improving Communication Performance in Multi-node LLM Inference
Prajwal Singhania, Siddharth Singh, Lannie Dalton Hough, Akarsh Srivastava, Harshitha Menon, Charles Fredrick Jekel, Abhinav Bhatele

TL;DR
This paper analyzes the performance bottlenecks of multi-node LLM inference on GPU supercomputers, introduces a hierarchical all-reduce algorithm NVRAR, and demonstrates significant latency improvements in distributed inference workloads.
Contribution
The paper presents a detailed performance study of multi-node LLM inference, develops the NVRAR hierarchical all-reduce algorithm, and shows its effectiveness in reducing latency.
Findings
NVRAR reduces all-reduce latency by up to 3.6× compared to NCCL.
Integrated NVRAR into YALIS, achieving up to 1.72× latency reduction in multi-node Llama 3.1 inference.
Identifies all-reduce as a key bottleneck in multi-node LLM inference.
Abstract
As large language models (LLMs) continue to grow in size, distributed inference has become increasingly important. Model-parallel strategies must now efficiently scale not only across multiple GPUs but also across multiple nodes. In this work, we present a detailed performance study of multi-node distributed inference using LLMs on GPU-based supercomputers. We conduct experiments with several state-of-the-art inference engines alongside YALIS, a research-oriented prototype engine designed for controlled experimentation. We analyze the strong-scaling behavior of different model-parallel schemes and identify key bottlenecks. Because all-reduce operations are a common performance bottleneck, we develop NVRAR, a hierarchical all-reduce algorithm based on recursive doubling with NVSHMEM. NVRAR achieves up to 1.9-3.6 lower latency than NCCL for message sizes between 128 KB and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsParallel Computing and Optimization Techniques · Natural Language Processing Techniques · Advanced Neural Network Applications
