Analytic Modeling of Idle Waves in Parallel Programs: Communication, Cluster Topology, and Noise Impact
Ayesha Afzal, Georg Hager, Gerhard Wellein

TL;DR
This paper develops an analytic model to understand how idle waves propagate and decay in parallel HPC programs, considering communication topology, collective interactions, and noise effects, validated through microbenchmarks and real supercomputers.
Contribution
It introduces a validated analytic model for idle wave propagation and decay in MPI programs, emphasizing topology and noise impacts, with extensive experimental validation.
Findings
Idle wave velocity depends on communication parameters and topology.
Collective operations can sometimes be transparent to idle waves.
Noise power influences the decay rate of idle waves.
Abstract
Most distributed-memory bulk-synchronous parallel programs in HPC assume that compute resources are available continuously and homogeneously across the allocated set of compute nodes. However, long one-off delays on individual processes can cause global disturbances, so-called idle waves, by rippling through the system. This process is mainly governed by the communication topology of the underlying parallel code. This paper makes significant contributions to the understanding of idle wave dynamics. We study the propagation mechanisms of idle waves across the ranks of MPI-parallel programs. We present a validated analytic model for their propagation velocity with respect to communication parameters and topology, with a special emphasis on sparse communication patterns. We study the interaction of idle waves with MPI collectives and show that, depending on the implementation, a collective…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDistributed and Parallel Computing Systems · Parallel Computing and Optimization Techniques · Advanced Data Storage Technologies
