Distributed Dependency Discovery

Hemant Saxena; Lukasz Golab; Ihab F. Ilyas

arXiv:1903.05228·cs.DB·March 14, 2019·5 cites

Distributed Dependency Discovery

Hemant Saxena, Lukasz Golab, Ihab F. Ilyas

PDF

Open Access

TL;DR

This paper explores the challenges of discovering data dependencies in distributed big data environments, introducing primitives to analyze and optimize communication costs, and validating their approach through experiments.

Contribution

It introduces six primitives for analyzing distributed dependency discovery algorithms, enabling the design of communication-efficient implementations.

Findings

01

Primitive-based analysis reveals key communication bottlenecks.

02

Communication-optimized algorithms outperform naive approaches.

03

Experimental results confirm the effectiveness of the primitives in real datasets.

Abstract

We analyze the problem of discovering dependencies from distributed big data. Existing (non-distributed) algorithms focus on minimizing computation by pruning the search space of possible dependencies. However, distributed algorithms must also optimize communication costs, especially in shared-nothing settings, leading to a more complex optimization space. To understand this space, we introduce six primitives shared by existing dependency discovery algorithms, corresponding to data processing steps separated by communication barriers. Through case studies, we show how the primitives allow us to analyze the design space and develop communication-optimized implementations. Finally, we support our analysis with an experimental evaluation on real datasets.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsData Quality and Management · Data Mining Algorithms and Applications · Advanced Database Systems and Queries