BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data
Sameer Agarwal, Aurojit Panda, Barzan Mozafari, Samuel Madden, Ion, Stoica

TL;DR
BlinkDB is an approximate query engine that enables interactive SQL queries on large datasets by trading off accuracy for response time, using adaptive sampling and dynamic sample selection.
Contribution
It introduces a novel adaptive optimization framework and dynamic sampling strategy for efficient approximate querying on massive data.
Findings
Answers queries on 17 TB data in less than 2 seconds
Achieves over 100x speedup compared to Hive
Provides results with 2-10% error bounds
Abstract
In this paper, we present BlinkDB, a massively parallel, sampling-based approximate query engine for running ad-hoc, interactive SQL queries on large volumes of data. The key insight that BlinkDB builds on is that one can often make reasonable decisions in the absence of perfect answers. For example, reliably detecting a malfunctioning server using a distributed collection of system logs does not require analyzing every request processed by the system. Based on this insight, BlinkDB allows one to trade-off query accuracy for response time, enabling interactive queries over massive data by running queries on data samples and presenting results annotated with meaningful error bars. To achieve this, BlinkDB uses two key ideas that differentiate it from previous work in this area: (1) an adaptive optimization framework that builds and maintains a set of multi-dimensional, multi-resolution…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Management and Algorithms · Advanced Database Systems and Queries · Cloud Computing and Resource Management
