HTC Scientific Computing in a Distributed Cloud Environment
R.Sobie, A. Agarwal, I. Gable, C. Leavett-Brown, M. Paterson, R., Taylor, A. Charbonneau, R. Impey, W. Podiama

TL;DR
This paper discusses the design, implementation, and operation of a distributed cloud system for high-throughput scientific computing, successfully managing hundreds of thousands of jobs across multiple IaaS clouds.
Contribution
It presents a unified distributed cloud infrastructure for HTC applications, combining existing and custom components, with operational insights and expansion plans.
Findings
System has processed approximately 500,000 jobs over two years
Handles about 500 parallel jobs running for 12 hours each
Demonstrates scalable, production-quality distributed cloud computing
Abstract
This paper describes the use of a distributed cloud computing system for high-throughput computing (HTC) scientific applications. The distributed cloud computing system is composed of a number of separate Infrastructure-as-a-Service (IaaS) clouds that are utilized in a unified infrastructure. The distributed cloud has been in production-quality operation for two years with approximately 500,000 completed jobs where a typical workload has 500 simultaneous embarrassingly-parallel jobs that run for approximately 12 hours. We review the design and implementation of the system which is based on pre-existing components and a number of custom components. We discuss the operation of the system, and describe our plans for the expansion to more sites and increased computing capacity.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCloud Computing and Resource Management · Distributed and Parallel Computing Systems · Scientific Computing and Data Management
