Efficient and Eventually Consistent Collective Operations

Roman Iakymchuk; Amandio Faustino; Andrew Emerson; Joao Barreto,; Valeria Bartsch; Rodrigo Rodrigues; Jose C. Monteiro

arXiv:2203.17063·cs.DC·April 1, 2022

Efficient and Eventually Consistent Collective Operations

Roman Iakymchuk, Amandio Faustino, Andrew Emerson, Joao Barreto,, Valeria Bartsch, Rodrigo Rodrigues, Jose C. Monteiro

PDF

TL;DR

This paper introduces an efficient, eventually consistent approach to collective operations in parallel computing, reducing communication overhead and improving performance for ML/DL and HPC applications, especially in strong scaling scenarios.

Contribution

It proposes a novel design for eventually consistent collectives, optimizing Broadcast and Reduce, and integrates classic collectives into GASPI, demonstrating promising preliminary performance gains.

Findings

01

Significant improvements in Allreduce and AlltoAll performance

02

Reduced communication in Broadcast and Reduce operations

03

Enhanced GASPI ecosystem with new collective implementations

Abstract

Collective operations are common features of parallel programming models that are frequently used in High-Performance (HPC) and machine/ deep learning (ML/ DL) applications. In strong scaling scenarios, collective operations can negatively impact the overall application performance: with the increase in core count, the load per rank decreases, while the time spent in collective operations increases logarithmically. In this article, we propose a design for eventually consistent collectives suitable for ML/ DL computations by reducing communication in Broadcast and Reduce, as well as by exploring the Stale Synchronous Parallel (SSP) synchronization model for the Allreduce collective. Moreover, we also enrich the GASPI ecosystem with frequently used classic/ consistent collective operations -- such as Allreduce for large messages and AlltoAll used in an HPC code. Our implementations show…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.