Runtime Optimization of Join Location in Parallel Data Management   Systems

Bikash Chandra; S. Sudarshan

arXiv:1703.01148·cs.DB·August 1, 2017·2 cites

Runtime Optimization of Join Location in Parallel Data Management Systems

Bikash Chandra, S. Sudarshan

PDF

Open Access

TL;DR

This paper introduces runtime techniques for optimizing join location decisions in parallel data systems, balancing data transfer, skew, and UDF computation to enhance throughput.

Contribution

It presents an extended ski-rental based algorithm for per-key join location decisions with performance guarantees, implemented on Hadoop, Spark, and Muppet.

Findings

01

Significant throughput improvements over existing methods.

02

Effective load balancing considering CPU, network, and I/O.

03

Robust performance guarantees in various scenarios.

Abstract

Applications running on parallel systems often need to join a streaming relation or a stored relation with data indexed in a parallel data storage system. Some applications also compute UDFs on the joined tuples. The join can be done at the data storage nodes, corresponding to reduce side joins, or by fetching data from the storage system to compute nodes, corresponding to map side join. Both may be suboptimal: reduce side joins may cause skew, while map side joins may lead to a lot of data being transferred and replicated. In this paper, we present techniques to make runtime decisions between the two options on a per key basis, in order to improve the throughput of the join, accounting for UDF computation if any. Our techniques are based on an extended ski-rental algorithm and provide worst-case performance guarantees with respect to the optimal point in the space considered by us.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Data Storage Technologies · Cloud Computing and Resource Management · Caching and Content Delivery