Optimizing Allreduce Operations for Modern Heterogeneous Architectures with Multiple Processes per GPU

Michael Adams; Amanda Bienz

arXiv:2508.13397·cs.DC·February 26, 2026

Optimizing Allreduce Operations for Modern Heterogeneous Architectures with Multiple Processes per GPU

Michael Adams, Amanda Bienz

PDF

Open Access

TL;DR

This paper introduces optimized all-reduce algorithms for heterogeneous multi-GPU systems, leveraging multiple CPU cores per GPU and advanced communication techniques to significantly reduce communication bottlenecks in large-scale deep learning workloads.

Contribution

It extends lane-aware all-reduce algorithms to heterogeneous architectures and utilizes multiple CPU cores per GPU to accelerate communication operations.

Findings

01

Achieved up to 3x speedup on LLNL's Tuolumne supercomputer.

02

Realized up to 2.45x speedup on NCSA's Delta supercomputer.

03

Demonstrated effectiveness of multi-CPU acceleration in large-scale GPU communication.

Abstract

Large inter-GPU all-reduce operations, prevalent throughout deep learning, are bottlenecked by communication costs. Emerging heterogeneous architectures are comprised of complex nodes, often containing $4$ GPUs and dozens to hundreds of CPU cores per node. Parallel applications are typically accelerated on the available GPUs, using only a single CPU core per GPU while the remaining cores sit idle. This paper presents novel optimizations to large GPU-aware all-reduce operations by extending the lane-aware algorithm to heterogeneous architectures and notably using multiple CPU cores per GPU to accelerate these operations. Using GPUDirect RDMA and host copy communications respectively, these multi-CPU-accelerated GPU-aware all-reduces yield speedups over system MPI of up to $3$ x on LLNL's Tuolumne supercomputer and up to $2.45$ x for large MPI all-reduces across the NVIDIA A100 GPUs of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDistributed and Parallel Computing Systems · Scheduling and Optimization Algorithms · Cloud Computing and Resource Management