Characterizing the Performance of Executing Many-tasks on Summit
Matteo Turilli, Andre Merzky, Thomas Naughton, Wael Elwasif, Shantenu, Jha

TL;DR
This paper evaluates the performance of executing many independent tasks on the Summit supercomputer using RADICAL-Pilot with JSM and PRRTE, highlighting scalability, overheads, and resource utilization improvements.
Contribution
It provides the first comprehensive performance characterization of RADICAL-Pilot with JSM and PRRTE on Summit for large-scale task workloads.
Findings
PRRTE scales better than JSM for over 1000 tasks
PRRTE overheads are negligible
Resource utilization reaches 63% with 16,000 tasks on 404 nodes
Abstract
Many scientific workloads are comprised of many tasks, where each task is an independent simulation or analysis of data. The execution of millions of tasks on heterogeneous HPC platforms requires scalable dynamic resource management and multi-level scheduling. RADICAL-Pilot (RP) -- an implementation of the Pilot abstraction, addresses these challenges and serves as an effective runtime system to execute workloads comprised of many tasks. In this paper, we characterize the performance of executing many tasks using RP when interfaced with JSM and PRRTE on Summit: RP is responsible for resource management and task scheduling on acquired resource; JSM or PRRTE enact the placement of launching of scheduled tasks. Our experiments provide lower bounds on the performance of RP when integrated with JSM and PRRTE. Specifically, for workloads comprised of homogeneous single-core, 15 minutes-long…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
