One Size Cannot Fit All: a Self-Adaptive Dispatcher for Skewed Hash Join in Shared-nothing RDBMSs
Jinxin Yang, Hui Li, Yiming Si, Hui Zhang, Kankan Zhao, Kewei Wei,, Wenlong Song, Yingfan Liu, and Jiangtao Cui

TL;DR
This paper introduces a self-adaptive dispatching method for hash join operations in shared-nothing RDBMSs that dynamically chooses the optimal strategy based on data skew, improving performance across various scenarios.
Contribution
It proposes a novel self-adaptive hash join solution with a cost model that selects the best strategy at runtime, addressing skewness challenges in distributed RDBMSs.
Findings
Self-adaptive hash join outperforms fixed strategies in skewed data scenarios.
The cost model effectively guides strategy selection at runtime.
Implementation in KaiwuDB demonstrates significant performance improvements.
Abstract
Shared-nothing architecture has been widely adopted in various commercial distributed RDBMSs. Thanks to the architecture, query can be processed in parallel and accelerated by scaling up the cluster horizontally on demand. In spite of that, load balancing has been a challenging issue in all distributed RDBMSs, including shared-nothing ones, which suffers much from skewed data distribution. In this work, we focus on one of the representative operator, namely Hash Join, and investigate how skewness among the nodes of a cluster will affect the load balance and eventual efficiency of an arbitrary query in shared-nothing RDBMSs. We found that existing Distributed Hash Join (Dist-HJ) solutions may not provide satisfactory performance when a value is skewed in both the probe and build tables. To address that, we propose a novel Dist-HJ solution, namely Partition and Replication (PnR). Although…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsPeer-to-Peer Network Technologies · Cloud Computing and Resource Management · Caching and Content Delivery
