SQaLe: A Large Text-to-SQL Corpus Grounded in Real Schemas
Cornelius Wolff, Daniel Gomm, Madelon Hulsebos

TL;DR
SQaLe is a large, semi-synthetic text-to-SQL dataset grounded in real schemas, designed to improve model generalization by capturing realistic schema and query complexity, diversity, and natural language ambiguity.
Contribution
We introduce SQaLe, a large-scale semi-synthetic dataset with 517,676 high-quality (question, schema, query) triples, built from real-world schemas to advance text-to-SQL research.
Findings
SQaLe is the most realistic large-scale text-to-SQL dataset to date.
It captures schema variability, diverse query patterns, and natural language ambiguity.
The dataset enables improved model generalization in text-to-SQL tasks.
Abstract
Advances in large language models have accelerated progress in text-to-SQL, methods for converting natural language queries into valid SQL queries. A key bottleneck for developing generalizable text-to-SQL models is the lack of large-scale datasets with sufficient schema and query complexity, domain coverage, and task diversity. We introduce SQaLe: a large-scale semi-synthetic text-to-SQL dataset built on 135,875 relational database schemas expanded from a collection of real-world schemas, SchemaPile. We establish a principled generation pipeline which combines schema sampling, question synthesis, and SQL construction, and produce 517,676 high-quality (question, schema, query) triples. The SQaLe dataset captures realistic schema size variability, diverse query patterns, and natural language ambiguity while maintaining execution validity. We provide an analysis of its contents and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Digital Humanities and Scholarship
