Workload-Aware Incremental Reclustering in Cloud Data Warehouses

Yipeng Liu; Renfei Zhou; Jiaqi Yan; Huanchen Zhang

arXiv:2602.23289·cs.DB·March 18, 2026

Workload-Aware Incremental Reclustering in Cloud Data Warehouses

Yipeng Liu, Renfei Zhou, Jiaqi Yan, Huanchen Zhang

PDF

Open Access

TL;DR

This paper introduces WAIR, a workload-aware reclustering algorithm for cloud data warehouses that optimizes query performance by selectively reclustering boundary micro-partitions, balancing efficiency and cost.

Contribution

It proposes a novel separation of reclustering policy from clustering-key selection and introduces boundary micro-partitions for targeted reclustering in dynamic environments.

Findings

01

WAIR achieves near-optimal query performance.

02

WAIR significantly reduces reclustering costs.

03

Experimental results outperform existing solutions.

Abstract

Modern cloud data warehouses store data in micro-partitions and rely on metadata (e.g., zonemaps) for efficient data pruning during query processing. Maintaining data clustering in a large-scale table is crucial for effective data pruning. Existing automatic clustering approaches lack the flexibility required in dynamic cloud environments with continuous data ingestion and evolving workloads. This paper advocates a clean separation between reclustering policy and clustering-key selection. We introduce the concept of boundary micro-partitions that sit on the boundary of query ranges. We then present WAIR, a workload-aware algorithm to identify and recluster only boundary micro-partitions most critical for pruning efficiency. WAIR achieves near-optimal (with respect to fully sorted table layouts) query performance but incurs significantly lower reclustering cost with a theoretical upper…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsCloud Computing and Resource Management · Advanced Database Systems and Queries · Data Quality and Management