Understanding and Improving Communication Performance in Multi-node LLM Inference

Prajwal Singhania; Siddharth Singh; Lannie Dalton Hough; Akarsh Srivastava; Harshitha Menon; Charles Fredrick Jekel; Abhinav Bhatele

arXiv:2511.09557·cs.DC·May 21, 2026

Understanding and Improving Communication Performance in Multi-node LLM Inference

Prajwal Singhania, Siddharth Singh, Lannie Dalton Hough, Akarsh Srivastava, Harshitha Menon, Charles Fredrick Jekel, Abhinav Bhatele

PDF

TL;DR

This paper analyzes the performance bottlenecks of multi-node LLM inference on GPU supercomputers, introduces a hierarchical all-reduce algorithm NVRAR, and demonstrates significant latency improvements in distributed inference workloads.

Contribution

The paper presents a detailed performance study of multi-node LLM inference, develops the NVRAR hierarchical all-reduce algorithm, and shows its effectiveness in reducing latency.

Findings

01

NVRAR reduces all-reduce latency by up to 3.6× compared to NCCL.

02

Integrated NVRAR into YALIS, achieving up to 1.72× latency reduction in multi-node Llama 3.1 inference.

03

Identifies all-reduce as a key bottleneck in multi-node LLM inference.

Abstract

As large language models (LLMs) continue to grow in size, distributed inference has become increasingly important. Model-parallel strategies must now efficiently scale not only across multiple GPUs but also across multiple nodes. In this work, we present a detailed performance study of multi-node distributed inference using LLMs on GPU-based supercomputers. We conduct experiments with several state-of-the-art inference engines alongside YALIS, a research-oriented prototype engine designed for controlled experimentation. We analyze the strong-scaling behavior of different model-parallel schemes and identify key bottlenecks. Because all-reduce operations are a common performance bottleneck, we develop NVRAR, a hierarchical all-reduce algorithm based on recursive doubling with NVSHMEM. NVRAR achieves up to 1.9 $\times$ -3.6 $\times$ lower latency than NCCL for message sizes between 128 KB and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsParallel Computing and Optimization Techniques · Natural Language Processing Techniques · Advanced Neural Network Applications