Scaling pair count to next galaxy surveys

S. Plaszczynski; J.E. Campagne; J. Peloton; and C. Arnault

arXiv:2012.08455·astro-ph.IM·January 4, 2022

Scaling pair count to next galaxy surveys

S. Plaszczynski, J.E. Campagne, J. Peloton, and C. Arnault

PDF

2 Repos

TL;DR

This paper introduces an efficient, scalable algorithm using Apache Spark and a novel sphere pixelization method to rapidly compute galaxy pair counts in massive datasets, addressing big data challenges in upcoming cosmological surveys.

Contribution

The paper presents a new sphere pixelization scheme (SARSPix) and an optimized Spark-based algorithm for fast pair counting in billion-scale galaxy datasets, improving computational efficiency.

Findings

01

Achieves pair-distance histogram computation in about 2 minutes for billion-scale data

02

Demonstrates scalability over 16 to 64 nodes in a Spark cluster

03

Provides publicly available software for large-scale galaxy data analysis

Abstract

Counting pairs of galaxies or stars according to their distance is at the core of real-space correlation analyzes performed in astrophysics and cosmology. Upcoming galaxy surveys (LSST, Euclid) will measure properties of billions of galaxies challenging our ability to perform such counting in a minute-scale time relevant for the usage of simulations. The problem is only limited by efficient access to the data, hence belongs to the big data category. We use the popular Apache Spark framework to address it and design an efficient high-throughput algorithm to deal with hundreds of millions to billions of input data. To optimize it, we revisit the question of nonhierarchical sphere pixelization based on cube symmetries and develop a new one dubbed the "Similar Radius Sphere Pixelization" (SARSPix) with very close to square pixels. It provides the most adapted indexing over the sphere for…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.