Simulation-based Optimization and Sensibility Analysis of MPI Applications: Variability Matters
Tom Cornebize (POLARIS, UGA), Arnaud Legrand (CNRS, POLARIS)

TL;DR
This paper introduces a simulation-based methodology for predicting MPI application performance, enabling efficient tuning and sensitivity analysis by modeling key parameters and platform variability, demonstrated on the HPL benchmark.
Contribution
It presents a novel surrogate modeling approach that decouples platform complexity from application behavior, allowing accurate performance predictions and optimization insights for MPI applications.
Findings
Performance predictions within a few percent accuracy
Effective identification of modeling pitfalls like variability and heterogeneity
Enabling sensitivity analysis under platform uncertainty
Abstract
Finely tuning MPI applications and understanding the influence of keyparameters (number of processes, granularity, collective operationalgorithms, virtual topology, and process placement) is critical toobtain good performance on supercomputers. With the high consumptionof running applications at scale, doing so solely to optimize theirperformance is particularly costly. Havinginexpensive but faithful predictions of expected performance could bea great help for researchers and system administrators. Themethodology we propose decouples the complexity of the platform, whichis captured through statistical models of the performance of its maincomponents (MPI communications, BLAS operations), from the complexityof adaptive applications by emulating the application and skippingregular non-MPI parts of the code. We demonstrate the capability of our method with High-PerformanceLinpack (HPL), the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsParallel Computing and Optimization Techniques · Distributed and Parallel Computing Systems · Cloud Computing and Resource Management
