SynQL: A Controllable and Scalable Rule-Based Framework for SQL Workload Synthesis for Performance Benchmarking
Kahan Mehta, Amit Mankodi

TL;DR
SynQL is a deterministic framework that generates diverse, realistic SQL workloads by traversing database schemas, aiding performance benchmarking and training data generation without relying on real logs.
Contribution
It introduces a schema-aware, controllable workload synthesis method that overcomes limitations of existing tools and LLMs, focusing on core SQL fragments.
Findings
SynQL produces highly diverse workloads with topological entropy of 1.53 bits.
Cost models trained on SynQL data achieve R^2 ≥ 0.79 on synthetic test sets.
SynQL generates execution-ready SQL workloads with schema and syntactic validity.
Abstract
Database research and the development of learned query optimisers rely heavily on realistic SQL workloads. Acquiring real-world queries is increasingly difficult, however, due to strict privacy regulations, and publicly released anonymised traces typically strip out executable query text to preserve confidentiality. Existing synthesis tools fail to bridge this training data gap: traditional benchmarks offer too few fixed templates for statistical generalisation, while Large Language Model (LLM) approaches suffer from schema hallucination fabricating non-existent columns and topological collapse systematically defaulting to simplistic join patterns that fail to stress-test query optimisers. We propose SynQL, a deterministic workload synthesis framework that generates structurally diverse, execution-ready SQL workloads. As a foundational step toward bridging the training-data gap, SynQL…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
