Improving the Performance and Resilience of MPI Parallel Jobs with   Topology and Fault-Aware Process Placement

Ioannis Vardas; Manolis Ploumidis; Manolis Marazakis

arXiv:2012.14757·cs.DC·January 6, 2021

Improving the Performance and Resilience of MPI Parallel Jobs with Topology and Fault-Aware Process Placement

Ioannis Vardas, Manolis Ploumidis, Manolis Marazakis

PDF

Open Access

TL;DR

This paper presents a topology and fault-aware process placement method for MPI jobs that reduces communication costs and improves job completion times by considering system topology and node failure probabilities, integrated into Slurm.

Contribution

It introduces a novel process placement approach that accounts for topology and fault-awareness, enhancing MPI job performance and resilience in large-scale HPC systems.

Findings

01

Achieves up to 31% reduction in MPI job completion time.

02

Effectively reduces communication costs by optimizing process placement.

03

Improves resilience by considering node failure probabilities.

Abstract

HPC systems keep growing in size to meet the ever-increasing demand for performance and computational resources. Apart from increased performance, large scale systems face two challenges that hinder further growth: energy efficiency and resiliency. At the same time, applications seeking increased performance rely on advanced parallelism for exploiting system resources, which leads to increased pressure on system interconnects. At large system scales, increased communication locality can be beneficial both in terms of application performance and energy consumption. Towards this direction, several studies focus on deriving a mapping of an application's processes to system nodes in a way that communication cost is reduced. A common approach is to express both the application's communication patterns and the system architecture as graphs and then solve the corresponding mapping problem.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDistributed and Parallel Computing Systems · Cloud Computing and Resource Management · Parallel Computing and Optimization Techniques