TL;DR
This paper introduces an extensible data skipping framework for SQL queries that supports custom metadata types and indexes, enabling significant speedups across diverse data types and UDFs with minimal development effort.
Contribution
It presents the first native support for data skipping with user-defined metadata and indexes, integrated into Apache Spark, enhancing flexibility and performance.
Findings
Achieves up to two orders of magnitude speedup in geospatial queries.
Provides a 3.6x speedup over Parquet min/max metadata-based query rewriting.
Requires only around 30 lines of code to implement new metadata indexes.
Abstract
Data skipping reduces I/O for SQL queries by skipping over irrelevant data objects (files) based on their metadata. We extend this notion by allowing developers to define their own data skipping metadata types and indexes using a flexible API. Our framework is the first to natively support data skipping for arbitrary data types (e.g. geospatial, logs) and queries with User Defined Functions (UDFs). We integrated our framework with Apache Spark and it is now deployed across multiple products/services at IBM. We present our extensible data skipping APIs, discuss index design, and implement various metadata indexes, requiring only around 30 lines of additional code per index. In particular we implement data skipping for a third party library with geospatial UDFs and demonstrate speedups of two orders of magnitude. Our centralized metadata approach provides a x3.6 speed up even when…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
