Text2SQL-Flow: A Robust SQL-Aware Data Augmentation Framework for Text-to-SQL
Qifeng Cai, Hao Liang, Chang Xu, Tao Xie, Wentao Zhang, Bin Cui

TL;DR
Text2SQL-Flow is a comprehensive data augmentation framework that generates high-quality, diverse Text-to-SQL datasets, significantly improving model performance and robustness in Text-to-SQL tasks through structured data enhancement.
Contribution
The paper introduces a novel SQL-aware data augmentation framework and a large-scale dataset, SQLFlow, enhancing Text-to-SQL model training and retrieval strategies with high-quality, diverse data.
Findings
Fine-tuning with SQLFlow improves LLM problem-solving performance.
The retrieval method using SQLFlow outperforms existing approaches.
The framework enables scalable, high-fidelity data generation for Text-to-SQL tasks.
Abstract
The data-centric paradigm has emerged as a pivotal direction in artificial intelligence (AI), emphasizing the role of high-quality training data. This shift is especially critical in the Text-to-SQL task, where the scarcity, limited diversity, and structural simplicity of existing datasets constrain model performance. To address these challenges, we propose Text2SQL-Flow, a SQL-aware data augmentation framework that systematically generates large-scale, semantically valid, and structurally diverse Text-to-SQL pairs from limited seed data. Our framework spans six augmentation dimensions and integrates an end-to-end pipeline with auxiliary database selection, SQL executability verification, natural language (NL) question generation, NL-SQL correspondence verification, and chain-of-thought (CoT) reasoning trace generation. Leveraging this framework, we construct SQLFlow, a high-quality…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Mathematics, Computing, and Information Processing
