Asynchronous Execution of Heterogeneous Tasks in ML-driven HPC Workflows
Vincent R. Pascuzzi, Ozgur O. Kilic, Matteo Turilli, Shantenu Jha

TL;DR
This paper investigates how asynchronous execution of heterogeneous tasks in ML-driven HPC workflows can improve resource utilization and reduce workflow completion time, supported by modeling and large-scale experiments on Summit.
Contribution
It models the degree of asynchronicity in workflows, proposes metrics to evaluate benefits, and validates performance improvements through large-scale experiments.
Findings
Asynchronous execution improves resource utilization.
Performance gains are consistent with the proposed model.
Experiments conducted at scale on Summit validate the approach.
Abstract
Heterogeneous scientific workflows consist of numerous types of tasks that require executing on heterogeneous resources. Asynchronous execution of those tasks is crucial to improve resource utilization, task throughput and reduce workflows' makespan. Therefore, middleware capable of scheduling and executing different task types across heterogeneous resources must enable asynchronous execution of tasks. In this paper, we investigate the requirements and properties of the asynchronous task execution of machine learning (ML)-driven high performance computing (HPC) workflows. We model the degree of asynchronicity permitted for arbitrary workflows and propose key metrics that can be used to determine qualitative benefits when employing asynchronous execution. Our experiments represent relevant scientific drivers, we perform them at scale on Summit, and we show that the performance…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsScientific Computing and Data Management · Distributed and Parallel Computing Systems · Cloud Computing and Resource Management
