In-database connected component analysis

Harald B\"ogeholz; Michael Brand; Radu-Alexandru Todor

arXiv:1802.09478·cs.DS·October 18, 2019

In-database connected component analysis

Harald B\"ogeholz, Michael Brand, Radu-Alexandru Todor

PDF

TL;DR

This paper presents a practical, SQL-compatible algorithm for efficiently computing connected components in large graph data within a parallel relational database, ensuring correctness and probabilistic efficiency.

Contribution

It introduces a linear-space, randomized algorithm for connected components that guarantees correctness and achieves near-linear runtime with high probability in a parallel database setting.

Findings

01

Algorithm terminates after O(log |V|) SQL queries with high probability

02

Empirical results show quasi-linear runtime performance

03

Algorithm is suitable for large-scale graph data in MPP relational databases

Abstract

We describe a Big Data-practical, SQL-implementable algorithm for efficiently determining connected components for graph data stored in a Massively Parallel Processing (MPP) relational database. The algorithm described is a linear-space, randomised algorithm, always terminating with the correct answer but subject to a stochastic running time, such that for any $ϵ > 0$ and any input graph $G = ⟨ V, E ⟩$ the algorithm terminates after $O (lo g ∣ V ∣)$ SQL queries with probability of at least $1 - ϵ$ , which we show empirically to translate to a quasi-linear runtime in practice.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.