In-database connected component analysis
Harald B\"ogeholz, Michael Brand, Radu-Alexandru Todor

TL;DR
This paper presents a practical, SQL-compatible algorithm for efficiently computing connected components in large graph data within a parallel relational database, ensuring correctness and probabilistic efficiency.
Contribution
It introduces a linear-space, randomized algorithm for connected components that guarantees correctness and achieves near-linear runtime with high probability in a parallel database setting.
Findings
Algorithm terminates after O(log |V|) SQL queries with high probability
Empirical results show quasi-linear runtime performance
Algorithm is suitable for large-scale graph data in MPP relational databases
Abstract
We describe a Big Data-practical, SQL-implementable algorithm for efficiently determining connected components for graph data stored in a Massively Parallel Processing (MPP) relational database. The algorithm described is a linear-space, randomised algorithm, always terminating with the correct answer but subject to a stochastic running time, such that for any and any input graph the algorithm terminates after SQL queries with probability of at least , which we show empirically to translate to a quasi-linear runtime in practice.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
