Factorized Machine Learning for Performance Modeling of Massively Parallel Heterogeneous Physical Simulations
Ardavan Oskooi, Christopher Hogan, Alec M. Hammond, M.T. Homer Reid,, Steven G. Johnson

TL;DR
This paper presents a neural network-based approach to predict runtime performance of complex, massively parallel heterogeneous physics simulations on cloud clusters, using a factorized modeling strategy to handle large input spaces efficiently.
Contribution
It introduces a novel factorized neural network approach that combines static load balancing and separation of computation and communication to improve runtime prediction accuracy.
Findings
Effective runtime prediction for complex simulations.
Reduced dependency on detailed spatial layouts.
Validated approach on open-source electrodynamics simulations.
Abstract
We demonstrate neural-network runtime prediction for complex, many-parameter, massively parallel, heterogeneous-physics simulations running on cloud-based MPI clusters. Because individual simulations are so expensive, it is crucial to train the network on a limited dataset despite the potentially large input space of the physics at each point in the spatial domain. We achieve this using a two-part strategy. First, we perform data-driven static load balancing using regression coefficients extracted from small simulations, which both improves parallel performance and reduces the dependency of the runtime on the precise spatial layout of the heterogeneous physics. Second, we divide the execution time of these load-balanced simulations into computation and communication, factoring crude asymptotic scalings out of each term, and training neural nets for the remaining factor coefficients.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsScientific Computing and Data Management · Parallel Computing and Optimization Techniques · Data Visualization and Analytics
