DynaHash: Efficient Data Rebalancing in Apache AsterixDB (Extended Version)
Chen Luo, Michael J. Carey

TL;DR
DynaHash is a novel data rebalancing method for shared-nothing systems that uses extendible hashing to achieve efficient, online, and low-cost data redistribution without blocking concurrent operations.
Contribution
It introduces DynaHash, combining dynamic bucketing with extendible hashing, and demonstrates its effective implementation in Apache AsterixDB for scalable data rebalancing.
Findings
DynaHash achieves low rebalancing costs and good load balance.
The implementation in AsterixDB shows improved performance on TPC-H.
Rebalancing is performed online without blocking reads or writes.
Abstract
Parallel shared-nothing data management systems have been widely used to exploit a cluster of machines for efficient and scalable data processing. When a cluster needs to be dynamically scaled in or out, data must be efficiently rebalanced. Ideally, data rebalancing should have a low data movement cost, incur a small overhead on data ingestion and query processing, and be performed online without blocking reads or writes. However, existing parallel data management systems often exhibit certain limitations and drawbacks in terms of efficient data rebalancing. In this paper, we introduce DynaHash, an efficient data rebalancing approach that combines dynamic bucketing with extendible hashing for shared-nothing OLAP-style parallel data management systems. DynaHash dynamically partitions the records into a number of buckets using extendible hashing to achieve good a load balance with small…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCloud Computing and Resource Management · Advanced Data Storage Technologies · Caching and Content Delivery
