EvolSQL: Structure-Aware Evolution for Scalable Text-to-SQL Data Synthesis
Xuanguang Pan, Chongyang Tao, Jiayuan Bai, Jianling Gao, Zhengwei Tao, Xiansheng Zhou, Gavin Cheung, Shuai Ma

TL;DR
EvolSQL is a structure-aware data synthesis framework that enhances Text-to-SQL training data by evolving SQL queries into more complex and diverse forms, improving model performance with less data.
Contribution
We introduce EvolSQL, a novel evolution-based data synthesis method that systematically increases SQL query complexity and diversity for better Text-to-SQL model training.
Findings
A 7B model trained on EvolSQL data outperforms one trained on larger datasets.
EvolSQL generates high-quality, structurally diverse SQL query pairs.
Our approach reduces data requirements while improving model performance.
Abstract
Training effective Text-to-SQL models remains challenging due to the scarcity of high-quality, diverse, and structurally complex datasets. Existing methods either rely on limited human-annotated corpora, or synthesize datasets directly by simply prompting LLMs without explicit control over SQL structures, often resulting in limited structural diversity and complexity. To address this, we introduce EvolSQL, a structure-aware data synthesis framework that evolves SQL queries from seed data into richer and more semantically diverse forms. EvolSQL starts with an exploratory Query-SQL expansion to broaden question diversity and improve schema coverage, and then applies an adaptive directional evolution strategy using six atomic transformation operators derived from the SQL Abstract Syntax Tree to progressively increase query complexity across relational, predicate, aggregation, and nesting…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsWeb Data Mining and Analysis · Advanced Database Systems and Queries · Scientific Computing and Data Management
