DynaHash: Efficient Data Rebalancing in Apache AsterixDB (Extended   Version)

Chen Luo; Michael J. Carey

arXiv:2105.11075·cs.DB·May 25, 2021

DynaHash: Efficient Data Rebalancing in Apache AsterixDB (Extended Version)

Chen Luo, Michael J. Carey

PDF

Open Access

TL;DR

DynaHash is a novel data rebalancing method for shared-nothing systems that uses extendible hashing to achieve efficient, online, and low-cost data redistribution without blocking concurrent operations.

Contribution

It introduces DynaHash, combining dynamic bucketing with extendible hashing, and demonstrates its effective implementation in Apache AsterixDB for scalable data rebalancing.

Findings

01

DynaHash achieves low rebalancing costs and good load balance.

02

The implementation in AsterixDB shows improved performance on TPC-H.

03

Rebalancing is performed online without blocking reads or writes.

Abstract

Parallel shared-nothing data management systems have been widely used to exploit a cluster of machines for efficient and scalable data processing. When a cluster needs to be dynamically scaled in or out, data must be efficiently rebalanced. Ideally, data rebalancing should have a low data movement cost, incur a small overhead on data ingestion and query processing, and be performed online without blocking reads or writes. However, existing parallel data management systems often exhibit certain limitations and drawbacks in terms of efficient data rebalancing. In this paper, we introduce DynaHash, an efficient data rebalancing approach that combines dynamic bucketing with extendible hashing for shared-nothing OLAP-style parallel data management systems. DynaHash dynamically partitions the records into a number of buckets using extendible hashing to achieve good a load balance with small…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsCloud Computing and Resource Management · Advanced Data Storage Technologies · Caching and Content Delivery