# An Empirical Evaluation of Allgatherv on Multi-GPU Systems

**Authors:** Thomas B. Rolinger, Tyler A. Simon, Christopher D. Krieger

arXiv: 1812.05964 · 2018-12-17

## TL;DR

This paper evaluates the performance of the Allgatherv collective communication routine on multi-GPU systems, analyzing how topology and communication libraries impact efficiency, especially with irregular message sizes in real-world applications.

## Contribution

It provides an empirical performance analysis of Allgatherv on diverse multi-GPU systems, highlighting discrepancies between micro-benchmark results and real application behavior.

## Key findings

- Irregular message sizes affect Allgatherv performance differently than micro-benchmarks.
- Communication library choice impacts performance across different hardware configurations.
- Real-world data trends can contradict micro-benchmark results.

## Abstract

Applications for deep learning and big data analytics have compute and memory requirements that exceed the limits of a single GPU. However, effectively scaling out an application to multiple GPUs is challenging due to the complexities of communication between the GPUs, particularly for collective communication with irregular message sizes. In this work, we provide a performance evaluation of the Allgatherv routine on multi-GPU systems, focusing on GPU network topology and the communication library used. We present results from the OSU-micro benchmark as well as conduct a case study for sparse tensor factorization, one application that uses Allgatherv with highly irregular message sizes. We extend our existing tensor factorization tool to run on systems with different node counts and varying number of GPUs per node. We then evaluate the communication performance of our tool when using traditional MPI, CUDA-aware MVAPICH and NCCL across a suite of real-world data sets on three different systems: a 16-node cluster with one GPU per node, NVIDIA's DGX-1 with 8 GPUs and Cray's CS-Storm with 16 GPUs. Our results show that irregularity in the tensor data sets produce trends that contradict those in the OSU micro-benchmark, as well as trends that are absent from the benchmark.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/1812.05964/full.md

## Figures

3 figures with captions in the complete paper: https://tomesphere.com/paper/1812.05964/full.md

## References

29 references — full list in the complete paper: https://tomesphere.com/paper/1812.05964/full.md

---
Source: https://tomesphere.com/paper/1812.05964