BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very   Large Data

Sameer Agarwal; Aurojit Panda; Barzan Mozafari; Samuel Madden; Ion; Stoica

arXiv:1203.5485·cs.DB·June 20, 2012·45 cites

BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data

Sameer Agarwal, Aurojit Panda, Barzan Mozafari, Samuel Madden, Ion, Stoica

PDF

Open Access

TL;DR

BlinkDB is an approximate query engine that enables interactive SQL queries on large datasets by trading off accuracy for response time, using adaptive sampling and dynamic sample selection.

Contribution

It introduces a novel adaptive optimization framework and dynamic sampling strategy for efficient approximate querying on massive data.

Findings

01

Answers queries on 17 TB data in less than 2 seconds

02

Achieves over 100x speedup compared to Hive

03

Provides results with 2-10% error bounds

Abstract

In this paper, we present BlinkDB, a massively parallel, sampling-based approximate query engine for running ad-hoc, interactive SQL queries on large volumes of data. The key insight that BlinkDB builds on is that one can often make reasonable decisions in the absence of perfect answers. For example, reliably detecting a malfunctioning server using a distributed collection of system logs does not require analyzing every request processed by the system. Based on this insight, BlinkDB allows one to trade-off query accuracy for response time, enabling interactive queries over massive data by running queries on data samples and presenting results annotated with meaningful error bars. To achieve this, BlinkDB uses two key ideas that differentiate it from previous work in this area: (1) an adaptive optimization framework that builds and maintains a set of multi-dimensional, multi-resolution…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsData Management and Algorithms · Advanced Database Systems and Queries · Cloud Computing and Resource Management