Distributed Dependency Discovery
Hemant Saxena, Lukasz Golab, Ihab F. Ilyas

TL;DR
This paper explores the challenges of discovering data dependencies in distributed big data environments, introducing primitives to analyze and optimize communication costs, and validating their approach through experiments.
Contribution
It introduces six primitives for analyzing distributed dependency discovery algorithms, enabling the design of communication-efficient implementations.
Findings
Primitive-based analysis reveals key communication bottlenecks.
Communication-optimized algorithms outperform naive approaches.
Experimental results confirm the effectiveness of the primitives in real datasets.
Abstract
We analyze the problem of discovering dependencies from distributed big data. Existing (non-distributed) algorithms focus on minimizing computation by pruning the search space of possible dependencies. However, distributed algorithms must also optimize communication costs, especially in shared-nothing settings, leading to a more complex optimization space. To understand this space, we introduce six primitives shared by existing dependency discovery algorithms, corresponding to data processing steps separated by communication barriers. Through case studies, we show how the primitives allow us to analyze the design space and develop communication-optimized implementations. Finally, we support our analysis with an experimental evaluation on real datasets.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Quality and Management · Data Mining Algorithms and Applications · Advanced Database Systems and Queries
