HTC Scientific Computing in a Distributed Cloud Environment

R.Sobie; A. Agarwal; I. Gable; C. Leavett-Brown; M. Paterson; R.; Taylor; A. Charbonneau; R. Impey; W. Podiama

arXiv:1302.1939·cs.DC·February 11, 2013

HTC Scientific Computing in a Distributed Cloud Environment

R.Sobie, A. Agarwal, I. Gable, C. Leavett-Brown, M. Paterson, R., Taylor, A. Charbonneau, R. Impey, W. Podiama

PDF

Open Access

TL;DR

This paper discusses the design, implementation, and operation of a distributed cloud system for high-throughput scientific computing, successfully managing hundreds of thousands of jobs across multiple IaaS clouds.

Contribution

It presents a unified distributed cloud infrastructure for HTC applications, combining existing and custom components, with operational insights and expansion plans.

Findings

01

System has processed approximately 500,000 jobs over two years

02

Handles about 500 parallel jobs running for 12 hours each

03

Demonstrates scalable, production-quality distributed cloud computing

Abstract

This paper describes the use of a distributed cloud computing system for high-throughput computing (HTC) scientific applications. The distributed cloud computing system is composed of a number of separate Infrastructure-as-a-Service (IaaS) clouds that are utilized in a unified infrastructure. The distributed cloud has been in production-quality operation for two years with approximately 500,000 completed jobs where a typical workload has 500 simultaneous embarrassingly-parallel jobs that run for approximately 12 hours. We review the design and implementation of the system which is based on pre-existing components and a number of custom components. We discuss the operation of the system, and describe our plans for the expansion to more sites and increased computing capacity.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsCloud Computing and Resource Management · Distributed and Parallel Computing Systems · Scientific Computing and Data Management