TL;DR
This paper introduces an efficient, scalable algorithm using Apache Spark and a novel sphere pixelization method to rapidly compute galaxy pair counts in massive datasets, addressing big data challenges in upcoming cosmological surveys.
Contribution
The paper presents a new sphere pixelization scheme (SARSPix) and an optimized Spark-based algorithm for fast pair counting in billion-scale galaxy datasets, improving computational efficiency.
Findings
Achieves pair-distance histogram computation in about 2 minutes for billion-scale data
Demonstrates scalability over 16 to 64 nodes in a Spark cluster
Provides publicly available software for large-scale galaxy data analysis
Abstract
Counting pairs of galaxies or stars according to their distance is at the core of real-space correlation analyzes performed in astrophysics and cosmology. Upcoming galaxy surveys (LSST, Euclid) will measure properties of billions of galaxies challenging our ability to perform such counting in a minute-scale time relevant for the usage of simulations. The problem is only limited by efficient access to the data, hence belongs to the big data category. We use the popular Apache Spark framework to address it and design an efficient high-throughput algorithm to deal with hundreds of millions to billions of input data. To optimize it, we revisit the question of nonhierarchical sphere pixelization based on cube symmetries and develop a new one dubbed the "Similar Radius Sphere Pixelization" (SARSPix) with very close to square pixels. It provides the most adapted indexing over the sphere for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
