EntropyDB: A Probabilistic Approach to Approximate Query Processing

Laurel Orr; Magdalena Balazinska; and Dan Suciu

arXiv:1911.04948·cs.DB·November 13, 2019

EntropyDB: A Probabilistic Approach to Approximate Query Processing

Laurel Orr, Magdalena Balazinska, and Dan Suciu

PDF

Open Access

TL;DR

EntropyDB introduces a probabilistic data summarization method based on the Principle of Maximum Entropy, enabling faster approximate query answering with controlled error, suitable for large datasets and linear queries.

Contribution

The paper presents a novel probabilistic framework for data summarization using maximum entropy, improving query speed and accuracy over traditional sampling methods.

Findings

01

Faster query answering than sampling on large datasets

02

Achieves comparable or lower error than sampling

03

Better distinguishes rare and nonexistent values

Abstract

We present EntropyDB, an interactive data exploration system that uses a probabilistic approach to generate a small, query-able summary of a dataset. Departing from traditional summarization techniques, we use the Principle of Maximum Entropy to generate a probabilistic representation of the data that can be used to give approximate query answers. We develop the theoretical framework and formulation of our probabilistic representation and show how to use it to answer queries. We then present solving techniques, give two critical optimizations to improve preprocessing time and query execution time, and explore methods to reduce query error. Lastly, we experimentally evaluate our work using a 5 GB dataset of flights within the United States and a 210 GB dataset from an astronomy particle simulation. While our current work only supports linear queries, we show that our technique can…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsData Management and Algorithms · Advanced Database Systems and Queries · Data Visualization and Analytics