Design and Performance Characterization of RADICAL-Pilot on Leadership-class Platforms
Andre Merzky, Matteo Turilli, Mikhail Titov, Aymen Al-Saadi, Shantenu, Jha

TL;DR
RADICAL-Pilot is a scalable, flexible runtime system designed to efficiently execute large-scale, heterogeneous scientific workloads on leadership-class supercomputers, supporting diverse task types and high concurrency.
Contribution
This paper introduces RADICAL-Pilot, a novel, portable, and extensible pilot system that demonstrates scalable performance on major HPC platforms for large, heterogeneous workloads.
Findings
Supports tens of thousands of heterogeneous tasks
Achieves scalable performance on DOE and NSF supercomputers
Effective for CPU, GPU, MPI, and Python tasks
Abstract
Many extreme scale scientific applications have workloads comprised of a large number of individual high-performance tasks. The Pilot abstraction decouples workload specification, resource management, and task execution via job placeholders and late-binding. As such, suitable implementations of the Pilot abstraction can support the collective execution of large number of tasks on supercomputers. We introduce RADICAL-Pilot (RP) as a portable, modular and extensible pilot-enabled runtime system. We describe RP's design, architecture and implementation. We characterize its performance and show its ability to scalably execute workloads comprised of tens of thousands heterogeneous tasks on DOE and NSF leadership-class HPC platforms. Specifically, we investigate RP's weak/strong scaling with CPU/GPU, single/multi core, (non)MPI tasks and Python functions when using most of ORNL Summit and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
