Improving the Performance and Resilience of MPI Parallel Jobs with Topology and Fault-Aware Process Placement
Ioannis Vardas, Manolis Ploumidis, Manolis Marazakis

TL;DR
This paper presents a topology and fault-aware process placement method for MPI jobs that reduces communication costs and improves job completion times by considering system topology and node failure probabilities, integrated into Slurm.
Contribution
It introduces a novel process placement approach that accounts for topology and fault-awareness, enhancing MPI job performance and resilience in large-scale HPC systems.
Findings
Achieves up to 31% reduction in MPI job completion time.
Effectively reduces communication costs by optimizing process placement.
Improves resilience by considering node failure probabilities.
Abstract
HPC systems keep growing in size to meet the ever-increasing demand for performance and computational resources. Apart from increased performance, large scale systems face two challenges that hinder further growth: energy efficiency and resiliency. At the same time, applications seeking increased performance rely on advanced parallelism for exploiting system resources, which leads to increased pressure on system interconnects. At large system scales, increased communication locality can be beneficial both in terms of application performance and energy consumption. Towards this direction, several studies focus on deriving a mapping of an application's processes to system nodes in a way that communication cost is reduced. A common approach is to express both the application's communication patterns and the system architecture as graphs and then solve the corresponding mapping problem.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDistributed and Parallel Computing Systems · Cloud Computing and Resource Management · Parallel Computing and Optimization Techniques
