Approximate Partition Selection for Big-Data Workloads using Summary   Statistics

Kexin Rong; Yao Lu; Peter Bailis; Srikanth Kandula; Philip Levis

arXiv:2008.10569·cs.DB·August 25, 2020

Approximate Partition Selection for Big-Data Workloads using Summary Statistics

Kexin Rong, Yao Lu, Peter Bailis, Srikanth Kandula, Philip Levis

PDF

TL;DR

This paper introduces a method for fast approximate query processing on big data clusters by selecting relevant data partitions using summary statistics, significantly reducing data reads without altering data layout.

Contribution

It proposes a novel approach leveraging pre-computed summary statistics to efficiently select and weight data partitions for approximate queries, improving speed and accuracy.

Findings

01

Achieves 2.7x to 70x reduction in partitions read for same error

02

Uses less than 100KB of statistics per partition

03

Demonstrates effectiveness across multiple datasets and layouts

Abstract

Many big-data clusters store data in large partitions that support access at a coarse, partition-level granularity. As a result, approximate query processing via row-level sampling is inefficient, often requiring reads of many partitions. In this work, we seek to answer queries quickly and approximately by reading a subset of the data partitions and combining partial answers in a weighted manner without modifying the data layout. We illustrate how to efficiently perform this query processing using a set of pre-computed summary statistics, which inform the choice of partitions and weights. We develop novel means of using the statistics to assess the similarity and importance of partitions. Our experiments on several datasets and data layouts demonstrate that to achieve the same relative error compared to uniform partition sampling, our techniques offer from 2.7 $\times$ to $70 \times$ …

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.