Incremental Query Processing on Big Data Streams

Leonidas Fegaras

arXiv:1511.07846·cs.DB·August 23, 2016

Incremental Query Processing on Big Data Streams

Leonidas Fegaras

PDF

TL;DR

This paper presents a method for automatically converting SQL-like queries into accurate, incremental programs for large-scale data streams, enabling real-time analysis with minimal state retention on distributed systems.

Contribution

It introduces a novel approach to generate exact incremental query programs for complex SQL-like queries on big data streams, improving over approximate methods.

Findings

01

The framework accurately processes nested, iterative, and join queries incrementally.

02

Prototype implementation on Spark demonstrates practical efficiency.

03

Experimental validation confirms the effectiveness of the approach.

Abstract

This paper addresses online query processing for large-scale, incremental data analysis on a distributed stream processing engine (DSPE). Our goal is to convert any SQL-like query to an incremental DSPE program automatically. In contrast to other approaches, we derive incremental programs that return accurate results, not approximate answers. This is accomplished by retaining a minimal state during the query evaluation lifetime and by using incremental evaluation techniques to return an accurate snapshot answer at each time interval that depends on the current state and the latest batches of data. Our methods can handle many forms of queries on nested data collections, including iterative and nested queries, group-by with aggregation, and equi-joins. Finally, we report on a prototype implementation of our framework, called MRQL Streaming, running on top of Spark and we experimentally…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.