Understanding Data Movement in AMD Multi-GPU Systems with Infinity   Fabric

Gabin Schieffer; Ruimin Shi; Stefano Markidis; Andreas Herten,; Jennifer Faj; Ivy Peng

arXiv:2410.00801·cs.DC·October 2, 2024

Understanding Data Movement in AMD Multi-GPU Systems with Infinity Fabric

Gabin Schieffer, Ruimin Shi, Stefano Markidis, Andreas Herten,, Jennifer Faj, Ivy Peng

PDF

Open Access

TL;DR

This paper investigates data movement performance in AMD multi-GPU systems using Infinity Fabric, proposing a methodology to evaluate communication strategies and demonstrating that direct peer-to-peer access and RCCL outperform MPI in latency and bandwidth.

Contribution

It introduces a comprehensive test and evaluation methodology for data movement in AMD multi-GPU systems, highlighting the performance benefits of direct peer-to-peer communication and RCCL over MPI.

Findings

01

Peer-to-peer memory access reduces latency.

02

RCCL outperforms MPI in bandwidth.

03

Evaluation methodology aids in optimizing multi-GPU communication.

Abstract

Modern GPU systems are constantly evolving to meet the needs of computing-intensive applications in scientific and machine learning domains. However, there is typically a gap between the hardware capacity and the achievable application performance. This work aims to provide a better understanding of the Infinity Fabric interconnects on AMD GPUs and CPUs. We propose a test and evaluation methodology for characterizing the performance of data movements on multi-GPU systems, stressing different communication options on AMD MI250X GPUs, including point-to-point and collective communication, and memory allocation strategies between GPUs, as well as the host CPU. In a single-node setup with four GPUs, we show that direct peer-to-peer memory accesses between GPUs and utilization of the RCCL library outperform MPI-based solutions in terms of memory/communication latency and bandwidth. Our test…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsParallel Computing and Optimization Techniques · Distributed and Parallel Computing Systems · Interconnection Networks and Systems