Using Pilot Systems to Execute Many Task Workloads on Supercomputers
Andre Merzky, Matteo Turilli, Manuel Maldonado, Mark, Santcroos, Shantenu Jha

TL;DR
This paper presents RADICAL-Pilot, a modular pilot system that enables efficient execution of large-scale, multi-task workloads on supercomputers by decoupling workload specification from resource management.
Contribution
It introduces RADICAL-Pilot's design, architecture, and performance characteristics, demonstrating its ability to handle thousands of concurrent tasks efficiently.
Findings
Supports over 100 tasks/second spawning rate
Handles up to 16,000 concurrent tasks
Flexible integration with application tools
Abstract
High performance computing systems have historically been designed to support applications comprised of mostly monolithic, single-job workloads. Pilot systems decouple workload specification, resource selection, and task execution via job placeholders and late-binding. Pilot systems help to satisfy the resource requirements of workloads comprised of multiple tasks. RADICAL-Pilot (RP) is a modular and extensible Python-based pilot system. In this paper we describe RP's design, architecture and implementation, and characterize its performance. RP is capable of spawning more than 100 tasks/second and supports the steady-state execution of up to 16K concurrent tasks. RP can be used stand-alone, as well as integrated with other application-level tools as a runtime system.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDistributed and Parallel Computing Systems · Parallel Computing and Optimization Techniques · Cloud Computing and Resource Management
