Parallel Evaluation of Multi-Semi-Joins
Jonny Daenen, Frank Neven, Tony Tan, Stijn Vansummeren

TL;DR
This paper introduces algorithms for parallel evaluation of complex semi-join queries in MapReduce, optimizing total computation time while maintaining low query response times, and demonstrates their effectiveness through experiments.
Contribution
It proposes a novel multi-semi-join MapReduce operator and greedy algorithms for optimizing parallel query plans for SGF queries, including disjunction and negation.
Findings
Parallel query plans outperform sequential plans in total time.
Optimizations significantly improve scalability and efficiency.
Experimental results show advantages over Pig and Hive.
Abstract
While services such as Amazon AWS make computing power abundantly available, adding more computing nodes can incur high costs in, for instance, pay-as-you-go plans while not always significantly improving the net running time (aka wall-clock time) of queries. In this work, we provide algorithms for parallel evaluation of SGF queries in MapReduce that optimize total time, while retaining low net time. Not only can SGF queries specify all semi-join reducers, but also more expressive queries involving disjunction and negation. Since SGF queries can be seen as Boolean combinations of (potentially nested) semi-joins, we introduce a novel multi-semi-join (MSJ) MapReduce operator that enables the evaluation of a set of semi-joins in one job. We use this operator to obtain parallel query plans for SGF queries that outvalue sequential plans w.r.t. net time and provide additional optimizations…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Management and Algorithms · Advanced Database Systems and Queries · Graph Theory and Algorithms
