TL;DR
The paper introduces a scalable, cloud-based astronomy analysis platform built on AWS, enabling efficient processing of large datasets with familiar tools like JupyterHub, demonstrated through real-world astronomical data analysis.
Contribution
It presents a deployable, scalable cloud platform for astronomical data analysis that integrates Apache Spark and JupyterHub, tailored for astronomers with moderate cloud expertise.
Findings
Successfully scaled to handle terabyte datasets
Enabled iterative analysis with transparent scaling
Deployed for multiple astronomical collaborations
Abstract
We present a scalable, cloud-based science platform solution designed to enable next-to-the-data analyses of terabyte-scale astronomical tabular datasets. The presented platform is built on Amazon Web Services (over Kubernetes and S3 abstraction layers), utilizes Apache Spark and the Astronomy eXtensions for Spark for parallel data analysis and manipulation, and provides the familiar JupyterHub web-accessible front-end for user access. We outline the architecture of the analysis platform, provide implementation details, rationale for (and against) technology choices, verify scalability through strong and weak scaling tests, and demonstrate usability through an example science analysis of data from the Zwicky Transient Facility's 1Bn+ light-curve catalog. Furthermore, we show how this system enables an end-user to iteratively build analyses (in Python) that transparently scale processing…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
