Incremental Techniques for Large-Scale Dynamic Query Processing

Iman Elghandour; Ahmet Kara; Dan Olteanu; Stijn Vansummeren

arXiv:1902.00585·cs.DB·February 5, 2019

Incremental Techniques for Large-Scale Dynamic Query Processing

Iman Elghandour, Ahmet Kara, Dan Olteanu, Stijn Vansummeren

PDF

Open Access

TL;DR

This paper reviews incremental query processing techniques for large-scale, real-time big data, highlighting recent algorithms that improve performance and address challenges in high-frequency updates and distributed streaming platforms.

Contribution

It provides a comprehensive overview of legacy and new algorithms for incremental query processing in big data environments, emphasizing recent advances and future directions.

Findings

01

New algorithms reduce processing complexity for high-frequency updates

02

Distributed streaming platforms like Spark Streaming and Flink are leveraged

03

Analysis of various approaches' characteristics and performance

Abstract

Many applications from various disciplines are now required to analyze fast evolving big data in real time. Various approaches for incremental processing of queries have been proposed over the years. Traditional approaches rely on updating the results of a query when updates are streamed rather than re-computing these queries, and therefore, higher execution performance is expected. However, they do not perform well for large databases that are updated at high frequencies. Therefore, new algorithms and approaches have been proposed in the literature to address these challenges by, for instance, reducing the complexity of processing updates. Moreover, many of these algorithms are now leveraging distributed streaming platforms such as Spark Streaming and Flink. In this tutorial, we briefly discuss legacy approaches for incremental query processing, and then give an overview of the new…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsData Management and Algorithms · Cloud Computing and Resource Management · Advanced Database Systems and Queries

Full text

Incremental Techniques for Large-Scale Dynamic Query Processing

Iman Elghandour1, Ahmet Kara2, Dan Olteanu2, Stijn Vansummeren1

1Université Libre de Bruxelles 2University of Oxford

Abstract

Many applications from various disciplines are now required to analyze fast evolving big data in real time. Various approaches for incremental processing of queries have been proposed over the years. Traditional approaches rely on updating the results of a query when updates are streamed rather than re-computing these queries, and therefore, higher execution performance is expected. However, they do not perform well for large databases that are updated at high frequencies. Therefore, new algorithms and approaches have been proposed in the literature to address these challenges by, for instance, reducing the complexity of processing updates. Moreover, many of these algorithms are now leveraging distributed streaming platforms such as Spark Streaming and Flink. In this tutorial, we briefly discuss legacy approaches for incremental query processing, and then give an overview of the new challenges introduced due to processing big data streams. We then discuss in detail the recently proposed algorithms that address some of these challenges. We emphasize the characteristics and algorithmic analysis of various proposed approaches and conclude by discussing future research directions.

1 Introduction

In a broad range of domains, such as Real Time Business Intelligence and Complex Event Processing, contemporary applications require the timely dynamic processing of complex analytical queries on continuously arriving data. Here, dynamic processing refers to updating the query result, preferably in real-time, when the underlying data is updated. Implementing such applications remains a difficult task, and involves resolving two orthogonal challenges:

•

Designing a suitable dynamic query processing algorithm that determines how the application’s query results are to be updated upon data changes, taking into account that previous results are already available and re-computation should be avoided to ensure timeliness.

•

Designing an implementation and deployment of the selected dynamic query processing algorithm that accounts for desiderata such as high throughput, low latency, and the ability to process large data sets. Current approaches mostly rely on distributed computing frameworks such as MapReduce, Flink, Spark, or Storm, to achieve this.

Fortunately, in recent years, there has been a flurry of research on both challenges that provide novel insights in how to resolve them. We briefly survey these next.

Algorithmical insights. Avoiding the re-computation whenever an update is received has long been approached using Incremental View Maintenance (IVM) techniques [7, 10]. IVM materializes the output of a query and then maintains that output under updates. Unfortunately, traditional IVM is not efficient for large databases that are updated at a high frequency. Therefore, new approaches have recently been proposed, whose objective is to reduce the complexity of processing updates and/or to reduce the required memory footprint.

Specifically, research in dynamic query processing has recently received a big boost with: (1) the introduction of Higher-Order IVM (HIVM) [15, 20, 14]; (2) the identification of lower bounds and worst-case optimal algorithms for processing updates [4, 5, 13]; (3) the practical formulations of worst-case-optimal IVM that implement and extend these algorithms [11, 12, 21]; and (4) the introduction of the notion of differential dataflow for computations that require recursive or iterative processing [18, 17]. These approaches often rely on materializing a succinct representation of a query’s output to maintain it more efficiently, and therefore present a fundamental breakthrough with traditional IVM techniques.

Big Data Frameworks support for dynamic query processing. Big data frameworks such as MapReduce [8] and Spark [25] are inherently batch-oriented. Early approaches for implementing dynamic query processing in these frameworks has focused on incremental processing of MapReduce tasks [6, 24, 22]. These are, however, based on traditional IVM techniques and suffer from high latency of MapReduce and its open source implementation Hadoop. More recent versions of distributed compute frameworks such as Apache Spark [26], Apache Flink [2], and Twitter Storm [23] / Heron [16] allow stream-based computations instead of batch-based computations. Out of the box, these frameworks mostly provide primitives for avoiding re-computation over sliding windows, based on traditional IVM. In addition, they present low-level programming primitives by which developers can express their own dynamic query processing algorithms. More recently, there are proposals to automatically incrementalize queries on distributed big data frameworks. Examples of these approaches include the distributed implementation of HIVM [19] and differential dataflow [18], as well as Spark Sructured Streaming [3].

2 Tutorial Structure

The tutorial runs for 3 hours and is divided into the following four parts:

**Part I: Introduction, desiderata, and traditional IVM

**We start the tutorial by giving an introduction to dynamic query processing and show examples that motivate the need for efficient incremental query processing. We give a high-level historical overview of traditional approaches (known as First Order IVM) that have been employed by conventional database systems to maintain query outputs. We present the strong and weak points of traditional approaches and then discuss new challenges introduced by streaming large data at high frequencies.

**Part II: Recent Algorithmic Advances in Dynamic Query Processing

**In the second part of the tutorial, we survey new efficient approaches and algorithms for dynamic query processing. We discuss the following research works: (1) Higher-Order IVM [15, 20, 14]; (2) Complexity lower bounds for dynamic query processing [4, 5, 13]; (3) Dynamic Yannakakis [11, 12]; (4) Factorized IVM [21]; (5) Space-time tradeoffs [13]; and (6) Beyond conjunctive queries: relations over application-dependent rings [9, 21, 14].

**Part III: Dynamic Query Processing in Big Data Frameworks

**Incremental processing of queries has been studied for queries executed by MapReduce [6, 24, 22] and by other distributed streaming platforms such as Spark Streaming [26, 3], Flink [2], and Storm [23]/Heron [16]. However, these systems rely on their users to specify how the queries are maintained or employ traditional incremental view maintenance approaches. Additionally, new parallel approaches that are executed in distributed environments [19] or that extend incremental processing [18] are introduced. We discuss all the mentioned approaches and platforms while highlighting the contributions that each one of them has made.

**Part IV: Outlook

**Finally we conclude by summarizing the existing research solutions and highlighting the open problems that are yet to be studied.

3 Acknowledgment

The authors are graciously supported by the Wiener-Anspach foundation. The work has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement 682588.

Bibliography26

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1]
2[2] Alexander Alexandrov et al . 2014. The Stratosphere Platform for Big Data Analytics. The VLDB Journal 23, 6 (Dec. 2014), 939–964.
3[3] Michael Armbrust et al . 2018. Structured Streaming: A Declarative API for Real-Time Applications in Apache Spark. In Proc. ACM SIGMOD Int. Conf. on Management of Data . 601–613.
4[4] Christoph Berkholz, Jens Keppeler, and Nicole Schweikardt. 2017. Answering Conjunctive Queries under Updates. In Proc. ACM SIGACT-SIGMOD-SIGART Symp. on Principles of Database Systems (PODS) . 303–318.
5[5] Christoph Berkholz, Jens Keppeler, and Nicole Schweikardt. 2018. Answering UC Qs under Updates and in the Presence of Integrity Constraints. In Proc. ACM Int. Conf. on Database Theory (ICDT) . 8:1–8:19.
6[6] Pramod Bhatotia, Alexander Wieder, Rodrigo Rodrigues, Umut A Acar, and Rafael Pasquin. 2011. Incoop: Map Reduce for Incremental Computations. In Proc. ACM Symposium on Cloud Computing . 7:1–7:14.
7[7] Rada Chirkova and Jun Yang. 2012. Materialized Views. Found. & Trends in DB 4, 4 (2012), 295–405.
8[8] Jeffrey Dean and Sanjay Ghemawat. 2004. Map Reduce: Simplified Data Processing on Large Clusters. In Proc. USENIX Conf. on Operating Systems Design and Implementation (OSDI) . 137–150.