Analyzing billion-objects catalog interactively: Apache Spark for physicists
S. Plaszczynski, J. Peloton, C. Arnault, J.E. Campagne

TL;DR
This paper demonstrates that Apache Spark can be effectively used by astronomers and cosmologists to perform interactive, large-scale data analyses on billions of objects, enabling efficient processing of datasets from future galaxy surveys.
Contribution
It shows how Spark can be practically applied by non-programmers in astronomy for large dataset analysis, with optimized algorithms and benchmarks on realistic cosmological data.
Findings
Most commands run within tens of seconds on 110 GB dataset
Algorithms enable interactive analysis of billions of objects
The approach is suitable for standard cosmological data processing
Abstract
Apache Spark is a Big Data framework for working on large distributed datasets. Although widely used in the industry, it remains rather limited in the academic community or often restricted to software engineers. The goal of this paper is to show with practical uses-cases that the technology is mature enough to be used without excessive programming skills by astronomers or cosmologists in order to perform standard analyses over large datasets, as those originating from future galaxy surveys. To demonstrate it, we start from a realistic simulation corresponding to 10 years of LSST data taking (6 billions of galaxies). Then, we design, optimize and benchmark a set of Spark python algorithms in order to perform standard operations as adding photometric redshift errors, measuring the selection function or computing power spectra over tomographic bins. Most of the commands execute on the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsScientific Computing and Data Management · Advanced Database Systems and Queries · Advanced Data Storage Technologies
