Bootstrapping Learned Cost Models with Synthetic SQL Queries

Michael Nidd; Christoph Miksovic; Thomas Gschwind; Francesco Fusco; Andrea Giovannini; Ioana Giurgiu

arXiv:2508.19807·cs.DB·August 28, 2025

Bootstrapping Learned Cost Models with Synthetic SQL Queries

Michael Nidd, Christoph Miksovic, Thomas Gschwind, Francesco Fusco, Andrea Giovannini, Ioana Giurgiu

PDF

TL;DR

This paper explores using synthetic SQL query generation, inspired by AI and LLM techniques, to improve learned cost models' accuracy with fewer training queries, enhancing database performance prediction.

Contribution

It introduces a synthetic data generation approach for training learned cost models, reducing the number of queries needed for accurate predictions.

Findings

01

Improved cost model accuracy with 45% fewer queries

02

Synthetic data enhances training efficiency

03

Effective for diverse database workloads

Abstract

Having access to realistic workloads for a given database instance is extremely important to enable stress and vulnerability testing, as well as to optimize for cost and performance. Recent advances in learned cost models have shown that when enough diverse SQL queries are available, one can effectively and efficiently predict the cost of running a given query against a specific database engine. In this paper, we describe our experience in exploiting modern synthetic data generation techniques, inspired by the generative AI and LLM community, to create high-quality datasets enabling the effective training of such learned cost models. Initial results show that we can improve a learned cost model's predictive accuracy by training it with 45% fewer queries than when using competitive generation approaches.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.