Vaex: Big Data exploration in the era of Gaia
Maarten A. Breddels, Jovan Veljanoski

TL;DR
Vaex is a Python library designed for fast, out-of-core exploration and visualization of extremely large tabular datasets, such as astronomical catalogues, enabling analysis of billions of rows efficiently.
Contribution
It introduces a scalable, memory-efficient framework with lazy evaluation and visualization capabilities tailored for big data in astronomy and related fields.
Findings
Supports analysis of datasets with billions of rows per second
Enables out-of-core processing with memory mapping and streaming algorithms
Provides a Pandas-like API for easy migration and integration
Abstract
We present a new Python library called vaex, to handle extremely large tabular datasets, such as astronomical catalogues like the Gaia catalogue, N-body simulations or any other regular datasets which can be structured in rows and columns. Fast computations of statistics on regular N-dimensional grids allows analysis and visualization in the order of a billion rows per second. We use streaming algorithms, memory mapped files and a zero memory copy policy to allow exploration of datasets larger than memory, e.g. out-of-core algorithms. Vaex allows arbitrary (mathematical) transformations using normal Python expressions and (a subset of) numpy functions which are lazily evaluated and computed when needed in small chunks, which avoids wasting of RAM. Boolean expressions (which are also lazily evaluated) can be used to explore subsets of the data, which we call selections. Vaex uses a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
