Workload-Aware Incremental Reclustering in Cloud Data Warehouses
Yipeng Liu, Renfei Zhou, Jiaqi Yan, Huanchen Zhang

TL;DR
This paper introduces WAIR, a workload-aware reclustering algorithm for cloud data warehouses that optimizes query performance by selectively reclustering boundary micro-partitions, balancing efficiency and cost.
Contribution
It proposes a novel separation of reclustering policy from clustering-key selection and introduces boundary micro-partitions for targeted reclustering in dynamic environments.
Findings
WAIR achieves near-optimal query performance.
WAIR significantly reduces reclustering costs.
Experimental results outperform existing solutions.
Abstract
Modern cloud data warehouses store data in micro-partitions and rely on metadata (e.g., zonemaps) for efficient data pruning during query processing. Maintaining data clustering in a large-scale table is crucial for effective data pruning. Existing automatic clustering approaches lack the flexibility required in dynamic cloud environments with continuous data ingestion and evolving workloads. This paper advocates a clean separation between reclustering policy and clustering-key selection. We introduce the concept of boundary micro-partitions that sit on the boundary of query ranges. We then present WAIR, a workload-aware algorithm to identify and recluster only boundary micro-partitions most critical for pruning efficiency. WAIR achieves near-optimal (with respect to fully sorted table layouts) query performance but incurs significantly lower reclustering cost with a theoretical upper…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCloud Computing and Resource Management · Advanced Database Systems and Queries · Data Quality and Management
