Scaling on Frontier: Uncertainty Quantification Workflow Applications using ExaWorks to Enable Full System Utilization
Mikhail Titov, Robert Carson, Matthew Rolchigo, John Coleman, James, Belak, Matthew Bement, Daniel Laney, Matteo Turilli, Shantenu Jha

TL;DR
This paper demonstrates how the EnTK middleware within the ExaWorks framework enables efficient, scalable execution of complex uncertainty quantification workflows on the Frontier supercomputer, achieving high resource utilization and adaptability.
Contribution
It introduces a novel application of EnTK for large-scale exascale workflows, showcasing its ability to handle heterogeneity, fault tolerance, and portability on Frontier.
Findings
Achieved up to 90% resource utilization on Frontier.
Successfully executed workflows on up to 8000 nodes.
Demonstrated flexible adaptation to different execution environments.
Abstract
When running at scale, modern scientific workflows require middleware to handle allocated resources, distribute computing payloads and guarantee a resilient execution. While individual steps might not require sophisticated control methods, bringing them together as a whole workflow requires advanced management mechanisms. In this work, we used RADICAL-EnTK (Ensemble Toolkit) - one of the SDK components of the ECP ExaWorks project - to implement and execute the novel Exascale Additive Manufacturing (ExaAM) workflows on up to 8000 compute nodes of the Frontier supercomputer at the Oak Ridge Leadership Computing Facility. EnTK allowed us to address challenges such as varying resource requirements (e.g., heterogeneity, size, and runtime), different execution environment per workflow, and fault tolerance. And a native portability feature of the developed EnTK applications allowed us to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
