High-Throughput Computing on High-Performance Platforms: A Case Study
Danila Oleynik, Sergey Panitkin, Matteo Turilli, Alessio Angius,, Kaushik De, Alexei Klimentov, Sarp H. Oral, Jack C. Wells, Shantenu Jha

TL;DR
This paper presents a case study of integrating high-throughput computing with supercomputers to meet future demands, focusing on the ATLAS experiment's use of Titan and lessons for scalable scientific computing.
Contribution
It evaluates design and operational considerations for using Titan at scale, characterizes a new executor for PanDA, and offers lessons for integrating experimental systems with supercomputers.
Findings
Achieved 52 million core-hours annually on Titan.
Developed a next-generation executor supporting new workloads.
Provided early insights into integrating experimental systems with supercomputers.
Abstract
The computing systems used by LHC experiments has historically consisted of the federation of hundreds to thousands of distributed resources, ranging from small to mid-size resource. In spite of the impressive scale of the existing distributed computing solutions, the federation of small to mid-size resources will be insufficient to meet projected future demands. This paper is a case study of how the ATLAS experiment has embraced Titan---a DOE leadership facility in conjunction with traditional distributed high- throughput computing to reach sustained production scales of approximately 52M core-hours a years. The three main contributions of this paper are: (i) a critical evaluation of design and operational considerations to support the sustained, scalable and production usage of Titan; (ii) a preliminary characterization of a next generation executor for PanDA to support new workloads…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDistributed and Parallel Computing Systems · Advanced Data Storage Technologies · Parallel Computing and Optimization Techniques
