EntropyDB: A Probabilistic Approach to Approximate Query Processing
Laurel Orr, Magdalena Balazinska, and Dan Suciu

TL;DR
EntropyDB introduces a probabilistic data summarization method based on the Principle of Maximum Entropy, enabling faster approximate query answering with controlled error, suitable for large datasets and linear queries.
Contribution
The paper presents a novel probabilistic framework for data summarization using maximum entropy, improving query speed and accuracy over traditional sampling methods.
Findings
Faster query answering than sampling on large datasets
Achieves comparable or lower error than sampling
Better distinguishes rare and nonexistent values
Abstract
We present EntropyDB, an interactive data exploration system that uses a probabilistic approach to generate a small, query-able summary of a dataset. Departing from traditional summarization techniques, we use the Principle of Maximum Entropy to generate a probabilistic representation of the data that can be used to give approximate query answers. We develop the theoretical framework and formulation of our probabilistic representation and show how to use it to answer queries. We then present solving techniques, give two critical optimizations to improve preprocessing time and query execution time, and explore methods to reduce query error. Lastly, we experimentally evaluate our work using a 5 GB dataset of flights within the United States and a 210 GB dataset from an astronomy particle simulation. While our current work only supports linear queries, we show that our technique can…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Management and Algorithms · Advanced Database Systems and Queries · Data Visualization and Analytics
