Pre-Execution Query Slot-Time Prediction in Cloud Data Warehouses: A Feature-Scoped Machine Learning Approach
Prashant Kumar Pathak

TL;DR
This paper introduces a machine learning model to predict BigQuery slot-time before execution using only pre-execution signals, improving cost estimation accuracy in cloud data warehouses.
Contribution
It presents a feature-scoped ML approach with a dual-model architecture that outperforms simple baselines on cost-significant queries.
Findings
Model achieves 74% explained variance on full workload.
On cost-significant queries, MAE reduced by 30-37% compared to baselines.
Long-tail queries remain challenging due to unobserved runtime factors.
Abstract
Cloud data warehouses bill compute based on slot-time consumed. In shared multi-tenant environments, query cost is highly variable and hard to estimate before execution, causing budget overruns and degraded scheduling. Static query-planner heuristics fail to capture complex SQL structure, data skew, and workload contention. We present a feature-scoped machine learning approach that predicts BigQuery slot-time before execution using only pre-execution observable signals: a structured query complexity score derived from SQL operator costs, data volume features from planner estimates and workload metadata, and textual features from query text. We deliberately exclude runtime factors (slot-pool utilization, cache state, realized skew) unknowable at submission. The model uses a HistGradientBoostingRegressor trained on log-transformed slot-time, with a TF-IDF + TruncatedSVD-512 text pipeline…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
