A Comprehensive Perspective on Pilot-Job Systems
Matteo Turilli, Mark Santcroos, Shantenu Jha

TL;DR
This paper provides a comprehensive analysis of Pilot-Job systems, exploring their motivations, evolution, core properties, and implementations to clarify their abstraction and address challenges in distributed scientific computing.
Contribution
It offers a detailed outline of the Pilot abstraction, its components, properties, and compares seven implementations to enhance understanding and interoperability.
Findings
Pilot-Job systems are crucial for large-scale scientific computing.
There is no standard definition or shared architecture for Pilot-Job systems.
Seven exemplar implementations reveal common properties and challenges.
Abstract
Pilot-Job systems play an important role in supporting distributed scientific computing. They are used to consume more than 700 million CPU hours a year by the Open Science Grid communities, and by processing up to 1 million jobs a day for the ATLAS experiment on the Worldwide LHC Computing Grid. With the increasing importance of task-level parallelism in high-performance computing, Pilot-Job systems are also witnessing an adoption beyond traditional domains. Notwithstanding the growing impact on scientific research, there is no agreement upon a definition of Pilot-Job system and no clear understanding of the underlying abstraction and paradigm. Pilot-Job implementations have proliferated with no shared best practices or open interfaces and little interoperability. Ultimately, this is hindering the realization of the full impact of Pilot-Jobs by limiting their robustness, portability,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
