Data management and execution systems for the Rubin Observatory Science   Pipelines

Nate B. Lust (1); Tim Jenness (2); James F. Bosch (1); Andrei Salnikov; (3); Nathan M. Pease (3); Michelle Gower (4); Mikolaj Kowalik (4); Gregory P.; Dubois-Felsmann (5); Fritz Mueller (3); Pim Schellart (1) ((1) Princeton; University; (2) Vera C. Rubin Observatory Project Office; (3) SLAC National; Accelerator Laboratory; (4) NCSA; (5) IPAC)

arXiv:2303.03313·astro-ph.IM·March 7, 2023·1 cites

Data management and execution systems for the Rubin Observatory Science Pipelines

Nate B. Lust (1), Tim Jenness (2), James F. Bosch (1), Andrei Salnikov, (3), Nathan M. Pease (3), Michelle Gower (4), Mikolaj Kowalik (4), Gregory P., Dubois-Felsmann (5), Fritz Mueller (3), Pim Schellart (1) ((1) Princeton, University

PDF

Open Access

TL;DR

The paper describes the Rubin Observatory's data management and execution system, including the Butler storage layer, metadata registry, and scalable pipeline execution infrastructure, enabling efficient large-scale data processing.

Contribution

Introduction of a comprehensive data management and pipeline execution system that abstracts complexities and scales from laptops to data centers.

Findings

01

Effective data abstraction layer simplifies algorithm development.

02

Scalable infrastructure supports diverse computing environments.

03

Automated pipeline creation enhances processing efficiency.

Abstract

We present the Rubin Observatory system for data storage/retrieval and pipelined code execution. The layer for data storage and retrieval is named the Butler. It consists of a relational database, known as the registry, to keep track of metadata and relations, and a system to manage where the data is located, named the datastore. Together these systems create an abstraction layer that science algorithms can be written against. This abstraction layer manages the complexities of the large data volumes expected and allows algorithms to be written independently, yet be tied together automatically into a coherent processing pipeline. This system consists of tools which execute these pipelines by transforming them into execution graphs which contain concrete data stored in the Butler. The pipeline infrastructure is designed to be scalable in nature, allowing execution on environments ranging…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDistributed and Parallel Computing Systems · Scientific Computing and Data Management · Advanced Data Storage Technologies