# AXS: A framework for fast astronomical data processing based on Apache   Spark

**Authors:** Petar Ze\v{c}evi\'c, Colin T. Slater, Mario Juri\'c, Andrew J., Connolly, Sven Lon\v{c}ari\'c, Eric C. Bellm, V. Zach Golkhou, Krzysztof, Suberlak

arXiv: 1905.09034 · 2019-07-10

## TL;DR

AXS is a scalable, open-source framework built on Apache Spark that enables fast, large-scale astronomical data analysis and cross-matching using familiar Python tools, demonstrated on major sky surveys.

## Contribution

The paper introduces AXS, a novel Spark-based framework with efficient distributed cross-matching capabilities tailored for astronomical datasets.

## Key findings

- Performed on-the-fly cross-match of Gaia DR2 and AllWise in ~30 seconds
- Supports querying and analyzing billion-row catalogs efficiently
- Provides a user-friendly Python API for astronomical data analysis

## Abstract

We introduce AXS (Astronomy eXtensions for Spark), a scalable open-source astronomical data analysis framework built on Apache Spark, a widely used industry-standard engine for big data processing. Building on capabilities present in Spark, AXS aims to enable querying and analyzing almost arbitrarily large astronomical catalogs using familiar Python/AstroPy concepts, DataFrame APIs, and SQL statements. We achieve this by i) adding support to Spark for efficient on-line positional cross-matching and ii) supplying a Python library supporting commonly-used operations for astronomical data analysis. To support scalable cross-matching, we developed a variant of the ZONES algorithm (Gray et al. 2004) capable of operating in distributed, shared-nothing architecture. We couple this to a data partitioning scheme that enables fast catalog cross-matching and handles the data skew often present in deep all-sky data sets. The cross-match and other often-used functionalities are exposed to the end users through an easy-to-use Python API. We demonstrate AXS' technical and scientific performance on SDSS, ZTF, Gaia DR2, and AllWise catalogs. Using AXS we were able to perform on-the-fly cross-match of Gaia DR2 (1.8 billion rows) and AllWise (900 million rows) data sets in ~ 30 seconds. We discuss how cloud-ready distributed systems like AXS provide a natural way to enable comprehensive end-user analyses of large datasets such as LSST.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/1905.09034/full.md

## Figures

5 figures with captions in the complete paper: https://tomesphere.com/paper/1905.09034/full.md

## References

45 references — full list in the complete paper: https://tomesphere.com/paper/1905.09034/full.md

---
Source: https://tomesphere.com/paper/1905.09034